A Review of Safe Reinforcement Learning Methods for Modern Power Systems

Tong Su, , Tong Wu, , Junbo Zhao, ,
Anna Scaglione, , Le Xie This work is supported by the U.S. Department of Energy Solar Energy Technologies Office under award 37770. Tong Su and Junbo Zhao are with the Department of Electrical and Computer Engineering, University of Connecticut, Storrs, CT 06269, USA (e-mail: [email protected]; [email protected]). Tong Wu and Anna Scaglione are with the Department of Electrical and Computer Engineering, Cornell Tech, Cornell University, New York City, NY 10044, USA (e-mail: [email protected]; [email protected]). Le Xie is with the Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA (e-mail: [email protected]).

Abstract

Due to the availability of more comprehensive measurement data in modern power systems, there has been significant interest in develo** and applying reinforcement learning (RL) methods for operation and control. Conventional RL training is based on trial-and-error and reward feedback interaction with either a model-based simulated environment or a data-driven and model-free simulation environment. These methods often lead to the exploration of actions in unsafe regions of operation and, after training, the execution of unsafe actions when the RL policies are deployed in real power systems. A large body of literature has proposed safe RL strategies to prevent unsafe training policies. In power systems, safe RL represents a class of RL algorithms that can ensure or promote the safety of power system operations by executing safe actions while optimizing the objective function. While different papers handle the safety constraints differently, the overarching goal of safe RL methods is to determine how to train policies to satisfy safety constraints while maximizing rewards. This paper provides a comprehensive review of safe RL techniques and their applications in different power system operations and control, including optimal power generation dispatch, voltage control, stability control, electric vehicle (EV) charging control, buildings’ energy management, electricity market, system restoration, and unit commitment and reserve scheduling. Additionally, the paper discusses benchmarks, challenges, and future directions for safe RL research in power systems.

Index Terms:

Safe reinforcement learning, machine learning, power system operation, power system control, energy management, optimal power generation dispatch, EV charging, voltage control.

Nomenclature

Notations

$\gamma$: Discount factor $\gamma\in[0,1)$
$\Delta$: Difference operator
$\delta$: Rotor angle
$\epsilon/A$: Inertia parameter of temperature and thermal conductivity of HVAC
$\varepsilon$: Safety constraint bound
$\zeta$: Safety probability ( $1-\zeta$ is the the risk probability)
$\eta,\eta^{\text{CHP}}_{p/h}$: Efficiency of charging or discharging, electrical/thermal energy efficiency of CHP
$\theta$: Parameters of the policy $\pi_{\theta}$
$\vartheta$: Grid state in the DC-PF approximation
$\bm{\Lambda}^{\text{EV}}_{\text{ch/dis}}$: Charging/selling electricity price of EV
$\bm{\Lambda}^{\text{Ele/Gas/Car}}$: Price of electricity/gas/carbon
$\lambda$: Penalty coefficient or Lagrange multiplier
$\Pi_{S}$: Policy set
$\pi_{\theta}$ , $\pi_{\theta}^{\text{adv}}$: Parameterized policy, policy of adversary
$\pi_{\theta}^{k}$ , $\pi_{\theta}^{k+\frac{1}{2}}$: Policy at iteration $k$ , intermediate policy between iterations $k$ and $k+1$
$\rho_{0}$: $\rho_{0}:\mathcal{S}\rightarrow[0,1]$ is starting state distribution of $\mathcal{S}$
$\tau$: Trajectory $\tau=(s_{0},a_{0},s_{1},\ldots)$
$\bm{\omega}$: Frequency
$\mathcal{A},\bm{a}$: Action set, action
$a^{\text{SG}}/b^{\text{SG}}/c^{\text{SG}}$: Fuel cost coefficients of SG
$\mathcal{B}/\mathcal{G}/\mathcal{R}$: BESS/SG/RES set
$\mathcal{C},C$: Constraint set $\mathcal{C}=\{(C_{i},\varepsilon_{i})\}^{m}_{i=1}$ , constraint cost function $C:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow\textbf{R}$
$c^{\text{RES/BESS}}$: Cost coefficients of RES/BESS
$\text{ch}/\text{dis}$: Charging/discharging of electricity or thermal for ESS
$\mathbb{D}$: Function to extract the vector of diagonal elements from a matrix
$M,L,\frac{1}{R},D$: Inertia constant, load dam** coefficient, speed droop response coefficient, $D=\frac{1}{R}+L$ is the combined frequency response coefficient from synchronous generators and load
$\mathbb{E},E,E_{\text{cap}}$: Expectation function, energy associated with devices, energy capacity of ESS
$\mathcal{E}/\mathcal{N}$: Edge/node set
$f,g,h$: State transition dynamics or the model of the environment, equality constraints with a total number of $m$ , inequality constraints with a total number of $n$ .
$G/N$: Cardinality of the set $\cal G/\cal N$
$\bm{g}$: Gas input of CHP or GB
$\mathcal{H}/*$: Hermitian/conjugate for a vector or matrix
$\bm{h}$: Thermal energy generation or load vector
$\bm{i}$: Current phasor vector
$\mathcal{J}_{R}^{\pi_{\theta}}$ , $\mathcal{J}_{h_{i}}^{\pi_{\theta}}$: Reward performance, constraint cost performance of inequality constraints
$\mathcal{L}$: Lagrangian
$\mathcal{M}$ , $\mathcal{M}_{C}$: MDP $\mathcal{M}=(\mathcal{S},\mathcal{A},\mathcal{P},r,\rho_{0},\gamma)$ , CMDP $\mathcal{M}_{C}=(\mathcal{S},\mathcal{A},\mathcal{P},R,\rho_{0},\gamma,% \mathcal{C})$
$\mathbb{P},\mathcal{P}$: Probability function, $\mathcal{P}:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow[0,1]$ is the transition matrix, where $\mathcal{P}(s_{t+1}|s_{t},a_{t})$ denotes the probability of state transition from $s_{t}$ to $s_{t+1}$ after taking action $a_{t}$
$P^{\text{Load}}_{\text{his/pre}}$: Historical/current net load forecast
$P_{\text{res}}$: Reserve requirement
$\bm{p}/\bm{q}$: Active/reactive power generation or load vector
$\overline{\bm{p}}^{\text{Gen}}_{e}$: Maximum emergency power generation of generator
$\bm{p}^{\text{Bus}}$: Bus power injection
$p_{ij}/q_{ij}/s_{ij}$: Active/reactive/apparent power for branch $ij$
$\bm{p}_{e}/\bm{p}_{m}$: Electrical/mechanical power
$R$: Reward function $R:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow\mathbb{R}$
$\bm{R}_{\text{up/down}}$: Ramp-up/down rate of generators
$r_{ij}/x_{ij}$: Resistance/reactance of line $ij$
$\mathcal{S},\bm{s}_{\text{ap}},\bm{s}$: State set, apparent power vector, state
$\bm{S}_{\text{up/down}}$: Start-up/shut-down rate of generators
$\mathcal{T},t$: Time step set of trajectory $\tau$ , time instant
$\overline{t}_{\text{up}}/\underline{t}_{\text{up}},t_{\text{tot}}$: Maximum/minimum up time of Gens, total time
$T,H,T^{I/O}$: Temperature, humidity, indoor/outdoor temperature
$\bm{u}_{\text{start/shut/com}}$: Startup/shutdown/commitment status of Gens
$\bm{v}/\bm{\phi}$: Voltage phasor/phase vector $\bm{v}_{t}=|\bm{v}|\odot e^{\mathfrak{j}\bm{\phi}}$ ,
$\mathbf{Y}/\mathbf{B}$: Admittance/susceptance matrix
$\overline{\ }/\underline{\ }$: Maximum/minimum values of the variable or vector

Abbreviations

AC/DC: Alternating current/direct current
ADN: Active Distribution Network
AMI: Advanced Metering Infrastructure
(B/M/T)ESS: (Battery/Mobile/Thermal) Energy Storage System
CHP: Combined Heat and Power system
(C)MDP: (Constrained) Markov Decision Process
CPO: Constrained Policy Optimization
CPPO: Constraint-controlled PPO
CS: Charging Station
CUP: Conservative Update Policy
DDPG: Deep Deterministic Policy Gradient
DG: Distributed Generation
DER: Distributed Energy Resource
(D/R)NN: (Deep/Recurrent) Neural Network
DSO: Distribution System Operator
(D/R)RL: (Deep/Robust) Reinforcement Learning
EHP: Electric Heat Pump
EV: Electric Vehicle
FACTS: Flexible AC Transmission System
FOCOPS: First Order Constrained Optimization in Policy Space
GCN: Graph Convolution Network
GB: Gas Boiler
Gen: Generator
GP: Gaussian Process
GPT: Generative Pre-trained Transformer
HVAC: Heating, Ventilation and Air-Conditioning
ICNN: Input Convex Neural Network
IPO: Interior-point Policy Optimization
Lag: Lagrangian methods
LLM: Large Language Model
MA(C): Multi-Agent (Constrained)
MIP: Mixed-Integer Linear
MPPT: Maximum Power Point Tracking
PCPO: Projection-based Constrained Policy Optimization
PDO: Primal-Dual Optimization
PILCO: Probabilistic Inference for Learning Control
PMU: Phasor Measurement Unit
PPO: Proximal Policy Optimization
p.u.: per unit
RES: Renewable Energy Source
RCPO: Reward Constrained Policy Optimization
SAC: Soft Actor-Critic
SafePO: Safe Policy Optimization
(SC)(O)PF: (Security Constrained) (Optimal) Power Flow
SG: Synchronous Generator
SoC: State of Change
TD3: Twin-Delayed Deep Deterministic policy gradient
TL: Thermal Load (such as room heater and water heater)
TR(PO/M): Trust Region (Policy Optimization/Method)
V2G: Vehicle-to-Grid
V, F: Voltage, Frequency

I Introduction

With the extensive integration of RESs, ESSs, and advanced power electronic devices, modern power systems are facing increased uncertainty and complexity, which translate to higher computational burden when modeling the stochastic non-linear nature of the control and decision problems. However, thanks to the widespread deployment of smart sensors, such as PMUs, along with advanced communication technologies, a vast amount of power system data can be measured and utilized for state estimation and control. As a result, data-driven approaches like RL have emerged as the key candidates for the numerical optimization of power systems decision and/or control policies[1], which would be otherwise intractable to derive. Conventionally, RL training is based on trial-and-error and reward feedback interaction with a model-based simulated environment [2] or a data-driven model-free simulated environment [3]. Recently, DRL, which embeds NNs as the policy function, has proven expressive enough to solve complicated control tasks. Additionally, the NN approach is used to reduce computation costs for online implementation. Once the NNs are trained, they approximate closed-form solutions and produce results quickly. However, nothing prevents the exploration of unsafe ranges during training and the execution of unsafe actions when the trained policies are deployed in real power systems. Therefore, the practical application of RL policies cannot be based on vanilla RL training [4].

In 2015, safe RL was first defined as “the process of learning policies that maximize the expectation of the reward in problems, where it is crucial to ensure reasonable system performance and/or respect safety constraints during the learning and/or deployment processes” [5]. Concurrently, the safe RL literature has been paid increasing attention. The methods can be coarsely divided into two categories: in one category the authors proposed to add to the reward function a safety factor that penalizes safety violations, and in the other category in the training phase the exploration process has been modified incorporating mechanisms that yield safe policies[5]. Based on these two approaches, numerous safe RL methods have been proposed and many have been applied and tailored for solving power systems decision and control problems, such as energy management, optimal power generation dispatch, EV Charging, voltage control, and others that this paper will cover in Section IV.

Reference [6] is currently the only paper that provides an overview of safe RL applications. However, the field is fast evolving and we aim to provide, first a comprehensive review of various safe RL techniques in general, and then a deep dive of their applications in power systems. The main contributions of the paper are as follows:

1.

This paper provides a comprehensive review of safe RL, covering its fundamental concepts, constraint classifications, existing algorithms, and benchmarks. It details the unique features and limitations of each RL algorithm, providing a foundation for future research endeavors in the domain of safe RL.
2.

Comprehensive review of the application of safe RL in power systems follows, covering almost all existing papers in this area. It categorizes these papers based on their application domains, listing each paper’s objectives, constraints, implemented safe RL techniques, environment types, and key features.
3.

We explore the key challenges and future research opportunities in safe RL for applications within power systems.

The framework of this paper is shown in Fig. 1. The rest of the paper is organized as follows. Section II introduces the CMDP and constraints. Section III provides a detailed introduction and classification of safe RL. Section IV offers a comprehensive review and comparative analysis of safe RL applications in different fields within the power system. Challenges and outlook are discussed in Section V and finally, Section VI concludes the paper.

Refer to caption — Figure 1: The framework of safe RL in power system application.

II Constrained Markov Decision Process

II-A Problem formulation

MDPs are defined by a tuple $\mathcal{M}=(\mathcal{S},\mathcal{A},\mathcal{P},R,\rho_{0},\gamma)$ which are, respectively, the state space, action space, probability distribution, reward function, initial state $\rho_{0}\in{\cal S}$ and discount factor. When the decision problem fits in an MDP, the objective is to determine the policy $\pi$ that maximizes the expected discounted reward $\mathcal{J}_{R}^{\pi_{\theta}}$ , i.e.[4, 7, 8]:

\mathcal{J}_{R}^{\pi_{\theta}}=\mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^{% \infty}\gamma^{t}R(\bm{s}_{t},\bm{a}_{t},\bm{s}_{t+1})\right]

(1)

where $\tau\sim\pi$ indicates that the distribution over trajectories depends on the policy $\pi$ ; similarly $\bm{s}_{0}\sim\rho_{0}$ , $\bm{a}_{t}\sim\pi(\cdot|\bm{s}_{t})$ , $\bm{s}_{t+1}\sim\mathcal{P}(\cdot|\bm{s}_{t},\bm{a}_{t})$ . Even if the transition probabilities and reward function are fully known, this task is often intractable. However, the approach taken normally is to learn the policy, using some parametrization.

The CMDP $\mathcal{M}_{C}=(\mathcal{S},\mathcal{A}_{t},\mathcal{P},R,\rho_{0},\gamma,% \mathcal{C})$ is an extension of a standard MDP, that addresses a frequent model variation: the case in which the action space ${\cal A}_{t}$ is a function of the state space ${\cal S}$ , i.e. $\bm{s}_{t}\mapsto{\cal A}_{t}$ , because the change in the environment affects what is a safe or feasible action, or due to the state-dependent cost of the action, which in the formulation needs to be below a threshold. This occurs in physical systems in which the boundary conditions, the state and the laws of physics limit what is feasible, what would lead to operations that are unsafe and how expensive is a certain agent action. In a nutshell, what differentiates the various instances of CMDP from a conventional MDP is the class of constraints that characterize the action space as a function of the system dynamics and the specific engineering problem and context that define the constraints. In this review, we define the CMDP for power system problems:

		$\displaystyle\max_{\pi_{\theta}\in\Pi_{S}}\mathcal{J}_{R}^{\pi_{\theta}}$		(2)
	s.t.	$\displaystyle~{}~{}\bm{a}_{t}\text{ is feasible }$		(2)

where $\bm{a}_{t}$ is feasible not only means that $\bm{a}_{t}$ is constrained within its upper and lower limits, but also that the resulting $\bm{s}_{t}$ falls within specified feasible sets. In power systems, constraints on the upper and lower bounds of $\bm{a}_{t}$ relate to the control ranges of various controllable devices, such as the power output of SGs, RESs, and ESSs, as well as the temperature setpoint of HVAC systems, which can typically be enforced by simply restricting the action space of RL. $\bm{s}_{t}$ falls within specified feasible sets means that the state adheres to safe and stable operation constraints, such as boundary constraints of voltages, line flows, and building temperatures, as well as stability constraints of voltages, frequency, and rotor angles. Due to the highly non-linear and non-convex nature of power systems, obtaining feasible $\bm{a}_{t}$ that guarantees feasible $\bm{s}_{t}$ is challenging. This is also the main challenge of training safe RL.

II-B Constraints

II-B1 Instantaneous Constraints

Instantaneous constraints are prevalent in power systems. For instance, in the optimal power generation dispatch of power systems, we encounter constraints such as power flow, dynamic limitations associated with BESSs, voltage magnitude bounds, and power generation limits, as detailed in Section IV-A. Another instance is voltage control, which incorporates additional voltage droop control dynamics and stability constraints, described in Section IV-B. We also explore other examples such as stability control, EV charging control, and building energy management in Section IV. In general, these constraints can be expressed as follows:

	$\displaystyle\max_{\pi_{\theta}\in\Pi_{S}}\mathcal{J}_{R}^{\pi_{\theta}}$	(3)
$\displaystyle\text{s.t.}~{}~{}g_{j}(\bm{s}_{t},\bm{a}_{t},$	$\displaystyle\bm{s}_{t+1})=0,~{}~{}j=1,\cdots,m$
$\displaystyle~{}h_{k}(\bm{s}_{t},\bm{a}_{t},$	$\displaystyle\bm{s}_{t+1})\leq 0,~{}~{}k=1,\cdots,n$

where the control action must fulfill both the $m$ equality and $n$ inequality constraints. We incorporate the terms $\bm{s}_{t}$ and $\bm{s}_{t+1}$ within these constraints to represent the time-varying bounds of $\bm{a}_{t}$ . Additionally, the dynamical constraints are also integrated into the aforementioned constraints.

II-B2 Cumulative Constraints

Cumulative constraints mandate that the sum or average of a specific cost signal remains within prescribed limits, calculated from the beginning of an event to the present time. Examples include total revenue and network throughput. These constraints are commonly applied in robot locomotion and manipulation, as discussed in [9]. Although several studies have attempted to adapt these constraints to power systems as a more flexible alternative to hard constraints, the application remains limited. For instance, [10] employs a discounted cumulative formulation in (4) to establish safety constraints in the management of distribution networks. In particular, they relax instantaneous constraints, such as voltage bounds, SoC bounds, and power quality, to a discounted cumulative formulation. Similarly, [11, 12] also utilize this approach. However, such constraints may not fully capture all safety requirements, though they do offer a partial enhancement of safety measures, providing some benefit over no constraints at all. The reason these studies do not consider instantaneous constraints is that cumulative relaxation offers a straightforward method to adapt constrained RL techniques, originally developed for robot locomotion and manipulation, to power systems. This approach not only simplifies implementation but also provides methodological insights that could potentially be extended to handle instantaneous constraints in future research.

To make the review more self-contained, we will review three kinds of cumulative constraints. In [13], the constraints for safe RL are divided into cumulative constraints and instantaneous constraints. For cumulative constraints, they are further categorized as discounted cumulative constraints (4), mean valued constraints (5), and probabilistic constraints (6). The discounted cumulative constraint is of the form:

\mathcal{J}_{h_{i}}^{\pi_{\theta}}=\mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^{% \infty}\gamma^{t}h_{i}(\bm{s}_{t},\bm{a}_{t},\bm{s}_{t+1})\right]\leq% \varepsilon_{i}

(4)

where $\varepsilon_{i}$ is the limit for each cumulative constraint.

The mean valued constraint is of the form:

\mathcal{J}_{h_{i}}^{\pi_{\theta}}=\mathbb{E}_{\tau\sim\pi}\left[\frac{1}{t_{% \text{tot}}}\sum_{t=0}^{t_{\text{tot}}-1}h_{i}(\bm{s}_{t},\bm{a}_{t},\bm{s}_{t% +1})\right]\leq\varepsilon_{i}

(5)

where $t_{\text{tot}}$ is the total number of time steps in each trajectory.

The second group concerns the probability that the cumulative costs violate a constraint [13]. Probabilistic constraints are of the form:

\mathcal{J}_{h_{i}}^{\pi_{\theta}}=\mathbb{P}\left[\sum_{t}h_{i}(\bm{s}_{t},% \bm{a}_{t},\bm{s}_{t+1})\leq\varepsilon_{i}\right]\geq\zeta

(6)

where $\eta_{i}$ is the cumulative cost threshold for each trajectory and $\varepsilon_{i}\in(0,1)$ is the probability limit.

Here, it is important to emphasize again that in power systems, the majority of constraints must be satisfied at every instant, thus they are commonly implemented as instantaneous constraints. For example, [14] utilizes the expected discounted reward, whereas constraints related to branch power flow and security operations are treated as instantaneous constraints.

II-C Constraints in Power Systems: Overview

In power system applications, the classification of constraints into instantaneous and cumulative constraints is related to the required degree of constraint satisfaction and the safe RL algorithms used. Typically, bus balance equations, upper and lower power limits of various equipment, ESS capacity constraints, certain voltage amplitude constraints, and some stability constraints are considered hard constraints. Safe RL algorithms capable of ensuring the satisfaction of hard constraints include projection method III-B, Lyapunov method III-C, shielding method III-E, safety layer method III-F and barrier function method III-G. For example, [15] uses the logarithmic barrier function to make the $\mathcal{J}_{h_{i}}^{\pi_{\theta}}$ approach infinity when voltage exceeds bounds, thereby satisfying hard voltage constraints. Due to discrepancies between models and real systems, various uncertainties of RESs and loads, and algorithmic shortcomings, even if constraints are theoretically satisfied, they may not be guaranteed in actual deployment. Therefore, GP methods III-D and RRL III-G have been proposed, using the probabilistic/chance constraint (6). However, their application in power systems remains underexplored. A more common approach is to use constrained game-theoretic RL within RRL [14, 16]. Furthermore, by design some safe RL algorithms can only encourage constraint satisfaction while maximizing rewards. Such algorithms include Lagrangian relaxation III-A and penalty functions. For example, [17] uses the voltage constraint metric $\mathcal{J}_{h_{i}}^{\pi_{\theta}}=\sum_{i\in\cal N}\max\left\{|\bm{v}_{i,t}-1% |-0.05|,0\right\}$ and employs Lagrangian relaxation for voltage control, which cannot guarantee absolute adherence to voltage constraints, thus classifying it as a soft constraint. For some constraints, instead, such as user satisfaction with EV charging and voltage control at certain nodes, the goal is to approach standard values as closely as possible, making them inherently soft constraints. The illustrations of different constraints of safe RL are shown in Fig. 2.

III Safe Reinforcement Learning

Safe RL is often formulated as a CMDP problem, where the objective is to maximize the reward of agents while ensuring that the agents satisfy safety constraints [18, 4]. Safe RL is categorized into different types from various perspectives. This section primarily categorizes these types based on the techniques used to ensure constraint satisfaction and provides detailed introductions of the techniques and benchmarks.

III-A Lagrangian Relaxation / Primal-Dual Method

Lagrangian relaxation, also known as primal-dual method, is the most common technique in safe RL. The key idea of this method is to transform the CMDP problem into an unconstrained dual problem. This is achieved by employing adaptive Lagrange multipliers to penalize constraints [19]:


	$\displaystyle\textbf{Instantaneous}:~{}$
	$\displaystyle\min_{\lambda_{i}\geq 0}\max_{\theta}\mathcal{L}(\lambda_{i},% \theta)=\min_{\lambda_{i}\geq 0}\max_{\theta}\left[J_{R}^{\pi_{\theta}}-\sum_{% i}\lambda_{i}\cdot h_{i}\right]$		(7a)
	$\displaystyle\textbf{Cumulative}:~{}$
	$\displaystyle\min_{\lambda_{i}\geq 0}\max_{\theta}\mathcal{L}(\lambda_{i},% \theta)=\min_{\lambda_{i}\geq 0}\max_{\theta}\left[J_{R}^{\pi_{\theta}}-\sum_{% i}\lambda_{i}\cdot\left(J_{h_{i}}^{\pi_{\theta}}-\varepsilon_{i}\right)\right]$		(7b)

The solution of (7) relies on Danskin’s theorem and convex analysis [20]. Due to its straightforward implementation and compatibility with both on-policy and off-policy methods, Lagrangian relaxation has been integrated with other RL algorithms, fostering the creation of numerous variants, such as DDPG-Lag, PPO-Lag, TRPO-Lag, TD3-Lag, SAC-Lag, MAPPO, RCPO, PDO, TRPO-PID, CPPO-PID, DDPG-PID, TD3-PID, SAC-PID [21, 22, 19, 23].

The Lagrangian relaxation method is the most commonly used approach in power systems, capable of being easily integrated with various algorithms for application across a wide range of domains. Based on instantaneous or hard constraints, [24] utilizes a primal-dual approach to optimize the control of power generation and BESS charging and discharging actions in a multi-stage real-time stochastic dynamic OPF. Additionally, [25] applies constrained SAC to the Volt-VAR control problem by synergistically combining the merits of the maximum-entropy framework, the method of multipliers, a device-decoupled neural network structure, and an ordinal encoding scheme. Furthermore, [26] employs constrained RL for the predictive control of OPF, paired with EV charging control. On the other hand, based on cumulative or soft constraints, [27] approximates the actor gradients by solving the Karush-Kuhn-Tucker conditions of the Lagrangian, instead of constructing reward critic networks and cost critic networks through interactions with the environment. Then, the interior point method is incorporated to derive the parameter updating rule for the DRL agent. Similarly, [28] develops a soft-constraint enforcement method to adaptively encourage the control policy in the safety direction with nonconservative control actions and find decisions with near-zero degrees of constraint violations.

III-B Projection Method / Trust Region Method

The TRM ensures constraint satisfaction at every step and enhances performance by updating the trust region policy gradient and projecting the policy into a safe feasible set during each iteration [29]. Typical projection methods include CPO [9], PCPO [30], FOCOPS [31], CUP [32], and MACPO[22], among which PCPO is implemented through a two-step process: first, conducting a local reward update, and then projecting the policy back onto the constraint set to address any constraint violations, as depicted in Fig. 3.

In the power system domain, TRMs have also seen widespread application. For instance, [33] introduced a projection-embedded MA-DRL algorithm that smoothly and effectively restricts the DRL agent action space to prevent any violations of physical constraints, thereby achieving decentralized optimal control of distribution grids with a guaranteed 100% safety rate. Additionally, in the area of EV charging problems, [34] utilizes a penalty function to penalize the neural network output if it exceeds the action space and uses a projection operator to avoid incurring a negative reward when no EV is occupying the charging bay. In addition, [35] employs CPO for volt-VAR control to minimize the total operation costs while satisfying the physical operation constraints. However, TRMs, primarily based on TRPO or PPO, are not easily integrated with other RL types and are computationally intensive in high dimensions, limiting their suitability for large-scale safe RL problems [36].

III-C Lyapunov Method

Lyapunov functions, widely used in control engineering for controller design [37], were first applied to safe RL in [38]. The application of the Lyapunov method in power systems is limited because it requires prior knowledge of a Lyapunov function. If the model of environmental dynamics is unknown, identifying a suitable Lyapunov function can be challenging. For example, [39] integrates a Lyapunov function into the structural properties of primary frequency controllers, guaranteeing local asymptotic stability over a large set of states. Additionally, [40] utilizes Lyapunov theory to design the controller that satisfies specific Lipschitz constraints for decentralized inverter-based voltage control. In addition, [41] utilizes a stability-constrained RL method for real-time voltage control in distribution grids, providing a formal voltage stability guarantee using the Lyapunov function.

III-D Gaussian Process Method

GP [42] is widely utilized in numerous approaches to estimate uncertainty and identify unsafe areas. Consequently, assessments based on GP can be incorporated into the learning process to enhance agent safety [43]. GP-based safe RL algorithms include SafeOpt [44] and PILCO [45]. The application of GP method-based safe RL in power systems is limited, meriting further research to adequately address the various uncertainties inherent in power systems. The potential disadvantage of GP methods is their computational complexity and scalability issues, especially as the dimensionality of the problem space increases [36].

III-E Shielding Method

In [46], the shield is introduced for the first time in RL. This shield is computed in advance, based on the safety component of the system specification provided and an abstraction of the dynamics of the agent’s environment. It guarantees safety with minimal interference, implying that the shield limits the agent’s actions as little as necessary, only prohibiting actions that could jeopardize the safe behavior of the system. The shielded RL is shown in Fig. 4.

Shielding is a method that enforces constraint satisfaction, making it highly suitable for power system problems with hard constraints. For instance, in [47], actions that would lead to dangerous states, such as the SoC of BESSs being fully charged or depleted, are substituted by the shielding mechanism with safe actions to maintain system stability. Additionally, [48] combines a correction model adapted from gradient descent with the prediction model as a post-posed shielding mechanism to enforce safe actions in computer room air conditioning unit control problems. In addition, in unit commitment scheduling, [49] utilizes action space clip** to ensure that uncertainty estimates are reasonable and within appropriate bounds obtained from historical data. A potential drawback of the shielding method is the challenge of identifying feasible, safe actions based on infeasible ones, which requires underlying knowledge of the system. This can be difficult for certain complex systems or specific control scenarios [36].

III-F Safety Layer Method

Both the safety layer and shielding method integrate safety into the RL process, but they differ in their implementation: the safety layer acts as an additional check within the RL framework, whereas shielding employs an external system (the shield) that intervenes only when necessary to prevent unsafe actions. The safety layer method, first proposed in [50] for continuous action spaces in RL, emphasizes maintaining zero-constraint violations throughout the learning process. It expresses safety constraints as linear functions of action through a first-order approximation. Assuming that at most one constraint is violated at any time, an analytical solution to the safety layer optimization problem can be directly obtained. The linearization equation and visualization of the safety layer are shown in (8) and 5, respectively.

\overline{h}_{i}(s_{t+1})\triangleq h_{i}(s_{t},a_{t})\approx\overline{h}_{i}(% s_{t})+g(s_{t};w_{i})^{T}a_{t}

(8)

where $w_{i}$ are weights of NN; $g(s_{t};w_{i})$ denotes first-order approximation to $h_{i}(s_{t},a_{t})$ with respect to $a_{t}$ .

The safety layer method has been widely applied in power systems. For example, in optimal power generation dispatch, [51] proposes a hybrid knowledge-data-driven safety layer to convert unsafe actions into the safety region, which is accelerated by a security-constrained linear projection model. Additionally, in volt-VAR control, [52] adds a safety layer to the policy neural network to enhance operational constraint satisfaction during both the initial exploration phase and the convergence phase. In addition, [53] uses action clip**, reward sha**, and expert demonstrations to ensure safe exploration and accelerate the training process during the online training stage for the assist service restoration problem. However, the linear approximation in the safety layer might not accurately capture the complexities of underlying dynamics in highly non-linear systems, and iterating at every time step could introduce a significant computational burden. Moreover, assuming only one constraint at a time may not be valid in complex environments where multiple safety constraints are concurrently active.

III-G Barrier Function Method

The barrier function method involves adding a barrier function penalty term to the original objective function. When the system state approaches the safety boundary, the value of the constructed barrier function tends to infinity, thereby ensuring that the state remains within the safe boundary [54]. The most typical barrier function method is IPO, which augments the objective with logarithmic barrier functions, drawing inspiration from the interior-point method [55]:


$\displaystyle\textbf{Instantaneous}:$	$\displaystyle~{}\max_{\theta}J_{R}^{\pi_{\theta}}+\sum_{i}\frac{1}{t_{i}}\log(% -h_{i})$	(9a)
$\displaystyle\textbf{Cumulative}:$	$\displaystyle~{}\max_{\theta}J_{R}^{\pi_{\theta}}+\sum_{i}\frac{1}{t_{i}}\log(% -J_{h_{i}}^{\pi_{\theta}}+\varepsilon_{i})$	(9b)

where $t_{i}$ is a hyperparameter for $h_{i}$ . The illustration of IPO is shown in Fig. 6.

Barrier function method and IPO have been widely applied in power systems to ensure the safety of constraints. For example, [12] utilizes IPO to ensure the fulfillment of distribution network constraints without the need for designated penalty terms and the associated tuning of penalty factors, or repeatedly solving optimization problems for action rectification. Additionally, [56] uses IPO to facilitate desirable learning behavior towards constraint satisfaction and policy improvement simultaneously during online preventive control for transmission overload relief. In addition, [57] proposes a safe RL method for emergency load shedding in power systems, where the reward function includes a barrier function that approaches negative infinity as the system state approaches safety bounds. However, the accurate formulation and tuning of barrier functions necessitate knowledge of system dynamics, which can be challenging in complex environments.

III-H Robust Reinforcement Learning

One of the challenges in RL is generalization under uncertainties not seen during training. To address this, RRL frameworks have been developed, focusing on enhancing the reliability and robustness of RL agents for the worst-case scenarios [58, 59]. Two notable approaches in this context are chance-constrained RRL and constrained game-theoretic RL. It is important to note that RRL is not universally recognized as a safe RL algorithm in other fields. However, due to the significant uncertainties in power systems, RRL is employed to enhance control robustness and is reviewed here.

III-H1 Chance-constrained RRL

Chance-constrained RRL, in particular, focuses on ensuring that policies perform well under uncertain conditions by incorporating probabilistic constraints into the learning process [60]. In this framework, the goal is not just to maximize expected rewards but to do so while ensuring that the probability of undesirable outcomes (e.g., safety violations) remains below a specified threshold [61]. This is particularly important in scenarios where safety and reliability are critical, such as autonomous driving or robotics [62]. The general form can be expressed as:

	$\displaystyle\max_{\pi}\mathcal{J}_{R}^{\pi_{\theta}}{}{}{}{}{}{}{}{}{}{}{}{}{% }{}{}{}{}{}{}{}{}$		(10)
	$\displaystyle\text{s.t.}~{}~{}\mathbb{P}\left[\min_{i}h_{i}(\bm{s}_{t},\bm{a}_% {t},\bm{s}_{t+1})\leq\varepsilon_{i}\right]\geq\zeta,\forall t\in\mathcal{T}$		(10)

III-H2 Constrained game-theoretic RL

Constrained game-theoretic RL is a framework that models the interaction between the RL agent and its environment as a game, specifically focusing on scenarios where there are constraints that the agent must respect during the learning and decision-making processes [63]. The objective is to maximize the agent’s rewards while minimizing the possible losses or costs, considering the worst-case scenarios posed by adversaries’ actions or environmental uncertainties [64]. Here’s a more accurate representation using a minimax optimization framework [63]:

	$\displaystyle\min_{\pi_{\theta}^{\text{adv}}}\max_{\pi_{\theta}}$	$\displaystyle~{}\mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^{\infty}\gamma^{t}R(s% _{t},a_{t},a_{t}^{\text{adv}},s_{t+1})\right]$		(11)
	$\displaystyle\text{s.t.}~{}~{}h_{i}$	$\displaystyle(s_{t},a_{t},a_{t}^{\text{adv}},s_{t+1})\leq 0,\forall t\in% \mathcal{T}$		(11)

One of the key benefits of constrained game-theoretic RL is its ability to handle competitive and cooperative interactions within complex environments, making it suitable for applications ranging from strategic games to cooperative multi-agent scenarios like mobile edge computing [65] and coordination in robotic teams [66].

RRL is applied in power systems to ensure that control strategies remain robust under various uncertainties. For example, [14] utilizes adversarial safe RL to address the model inaccuracy and uncertainty of virtual power plants without relying on an accurate environmental model. Additionally, in the sequential OPF problem, [51] employs a bi-level robust optimization approach to optimize the training loss of the Q network. In addition, in the inverter-based volt-VAR control problem, [16] develops a highly efficient adversarial RL algorithm to train an offline agent that is robust to model mismatches during the offline stage.

III-I Benchmarks

Benchmarks include both benchmark environments and benchmark algorithms. Safety Gym, developed by OpenAI, is the first widely recognized safe benchmark environment. It includes an environment-builder and a suite of pre-configured benchmark environments [21, 67]. Correspondingly, Safety Starter Agents, a benchmark algorithm library, has been developed based on Safety Gym [68]. The supported algorithms in this library include PPO, PPO-Lag, TRPO, TRPO-Lag, SAC, SAC-Lag, and CPO. This package has been tested on Mac OS Mojave and Ubuntu 16.04 LTS and is likely compatible with most recent Mac and Linux operating systems.

Safety Gymnasium, an update and extension of Safety Gym, has currently become the mainstream platform in use [69, 70]. Correspondingly, a benchmark repository for safe RL algorithms has been proposed, named SafePO [71]. SafePO is tested on the Linux platform and potentially supports Mac or Windows, requiring only modifications to the Linux path and sort functions for compatibility.

SafePO further extends the variety of supported safe RL algorithms, as illustrated in Fig. 7.

OmniSafe emerges as the first unified learning framework in the field of safe RL, featuring a highly modular framework that includes a comprehensive collection of algorithms specifically developed for safe RL across various domains. Its versatility comes from an abstracted algorithm structure and a well-designed API, facilitating seamless integration of different components, thereby simplifying extension and customization for developers. Additionally, OmniSafe enhances algorithm learning speeds through process parallelism, supporting both environment-level and agent asynchronous parallel learning. OmniSafe is supported and tested on Linux and also supports M1 and M2 versions of macOS. However, it does not support Windows [72, 73]. The supported safe RL algorithms of OmniSafe are shown in Table I.

TABLE I: Supported Safe RL Algorithms of OmniSafe

Domains	Types	Algorithms Registry
On Policy	Primal-Dual	TRPO-Lag; PPO-Lag; PDO; RCPO
	Convex Optimization	CPO; PCPO; FOCOPS; CUP
	Penalty Function	IPO; P3O
	Primal	OnCRPO
Off Policy	Primal-Dual	DDPG-Lag; TD3-Lag; SAC-Lag
		DDPG-PID; TD3-PID; SAC-PID
Model-based	Online Plan	SafeLOOP; CCEPETS; RCEPETS
	Pessimistic Estimate	CAPPETS
Offline	Q-Learning-Based	BCQ-Lag; C-CRR
	DICE-Based	COptDICE
	ET-MDP	PPO/TRPO-EarlyTerminated
Other MDP	SauteRL	PPOSaute; TROPSaute
	SimmerRL	PPOSimmer-PID; TROPSimmer-PID

Overall, Safety Gymnasium is the current mainstream benchmark environment, and OmniSafe has also integrated Safety Gymnasium to ensure overall code compatibility. It is important to remark that Safety Gymnasium was primarily developed for control in gaming, robotics, autonomous driving, etc., featuring a series of agents such as point, car, dog, and ant, among others. It offers several specific environments tailored for challenges such as safe navigation, safe velocity, and safe vision, but it is not directly applicable to power systems problems’ formulations. Hence, there is a need to develop corresponding power system control environments based on the environment templates provided by Safety Gymnasium. In terms of benchmark algorithms, OmniSafe offers a more comprehensive set of algorithms but currently does not support Windows due to difficulties with Python library installations. In contrast, SafePO is more easily expanded on Windows. Since most power system professional software is developed for Windows, with less support for Linux and macOS, this may limit the application of OmniSafe in model-based environments. However, if surrogate models are used to substitute for physical models in a model-free environment, OmniSafe can be utilized in Linux or macOS.

IV Power System Applications of Safe RL

This review synthesizes a broad collection of studies and applications of safe RL in power systems, covering a wide array of domains: optimal power generation dispatch, voltage control, stability control, EV charging control, building energy management, electricity market, system restoration, and unit commitment and reserve scheduling. Safe RL algorithms used in various application domains are presented in Fig. 1. As depicted in Fig. 8, RL-based schemes collect power system measurements, including PMU and AMI readings, and integrate system model knowledge into their policy training. They take action to control power system devices, ensuring safety requirements like feasibility, stability, and robustness are met. The research problem or objective function, constraint, constraint type (cumulative/instantaneous and hard/soft), applied safe constraint techniques, and key features are reviewed to compare different researches using safe RL across various domains.

IV-A Optimal Power Generation Dispatch

TABLE II: Safe RL Applications in Optimal Power Generation Dispatch

Research Problem/ Objective Constraint Constraint Type Safety Constraint Techniques Key Features [27] Minimize the total generation cost Physical operation constraints Cum/Soft Primal-dual method (III-A) Combines the primal-dual DDPG with the classic SCOPF model. The actor gradients are approximated by solving the Karush-Kuhn-Tucker conditions of the Lagrangian. [24] Minimize the fuel costs and power loss from BESSs Physical constraints Ins/Hard Projection (III-B) and primal-dual method (III-A) A primal-dual approach is introduced to learn optimal constrained DRL policies specifically for predictive control in real-time stochastic dynamic OPF. [74] Minimize the total system cost Physical constraints Cum/Hard Safety layer (III-F) Unsafe actions are projected into the safe action space while constrained zonotope set is used to improve efficiency. [75] Minimize the cost of thermal power MESS Power grid and MESSs constraints Ins/Hard Proximal gradient projection (III-B) MESSs are modeled as CMDP, and a framework is proposed based on a DRL algorithm that considered the discrete-continuous hybrid action space of the MESSs. [15] Minimize the total energy cost Power system constraints Cum/Hard Lagrange relaxation (III-A) and logarithmic barrier (III-G) Function approximation addresses large, continuous state spaces, while a diffusion strategy coordinates actions of DG units and ESSs. [76] Minimize the generator fuel cost Power system constraints Ins/Hard Safety layer (III-F) The proposed method uses physics-driven parameters for easy modification and less conservative, easily re-parameterizable actions. [77] Minimize the operating cost Power system constraints Ins/Hard Safety layer (III-F) To avoid line overload, a safety layer is added by introducing transmission constraints to avoid dangerous actions and tackle sequential security-constrained OPF problem. [10] Minimize the total operating cost Physical constraints of system and devices Cum/Hard CPO (III-B) To optimize both discrete and continuous actions, a stochastic policy based on a joint distribution of mixed random variables is designed and learned through a NN approximator. [11] Minimize the total cost of operation of microgrids Global and local constraints Cum/Soft Lagrangian relaxation (III-A) and projection (III-B) The training process employs the gradient information of operational constraints to ensure that the optimal control policy functions generate safe and feasible decisions. [78] Minimize the operational cost Operation and power balance constraints Cum/Hard CPO (III-B) and invalid action masking (III-E) Invalid action masking is applied to avoid invalid actions, accomplished by replacing the logits of the actions to be masked with a large negative number. [79] Minimize the total operational cost AC-PF constraints Cum/Hard CPO (III-B) Contrary to traditional DRL methods, the proposed method constrains exploration to only those policies that comply with AC-PF constraints. [28] Minimize the total operational cost Gas system and power system constraints Cum/Soft Lagrangian relaxation (III-A) The penalty is adaptively updated based on the extent of constraint violation, facilitating the prediction of near-optimal control actions that achieve near-zero degrees of violation. [80] Minimize the operating cost for the whole horizon Operational constraints Ins/Hard MIP formulation The action-value function, approximated through a DNN, is structured as a MIP formulation, enabling the inclusion of constraints within the action space. [81] Optimize the total generation cost Operational and linguistic stipulation constraints N.A./Soft Primal-dual method (III-A) For the first time, a GPT LLM is integrated into the OPF framework alongside linguistic rules. This novel approach models and quantifies natural language stipulations as objectives and constraints within a primal-dual DRL loop. [82] Minimize the total operation cost Operational constraints N.A./Soft Lagrangian relaxation (III-A) Instead of using the critic network, the deterministic gradient is derived analytically and solved by using interior point method. [83] Minimize the total energy cost Satisfaction of the energy demand Cum/Soft Lagrangian relaxation (III-A) and RRL (III-H) This approach efficiently uses short-horizon forecasts to prevent energy demand failures and reduce costs, surpassing the capabilities of standard safe RL methods. [12] Minimize the costs of DGs production and RES curtailment Constraints of distribution network Cum/Hard IPO (III-G) The generalization of IPO is improved by extracting spatial-temporal features from microgrid operation data, leveraging the advantages of edge-conditioned convolutional networks and long short-term memory networks. [84] Multi-energy management Thermal energy balance Cum/Hard Shielding method (III-E) Decoupling architecture of safety constraint formulations from the RL formulation. Hard-constraint satisfaction without the need to solve a mathematical program. [85] Minimize the cost of electricity net, DG and gas Constraints of the power and gas networks Ins/Hard Safety layer (III-F) By learning a dynamic security assessment rule, a physically-informed safety layer ensures adherence to physical constraints by solving an action correction formulation. [14] Minimize the overall operation cost Branch power flow security constraint Ins/Soft Lagrangian relaxation (III-A) and RRL (III-H) An adversarial safe RL approach is proposed to enhance action safety and robustness against deviations between training and testing environments. [51] Minimize the operation cost Operational constraints Ins/Hard Safety layer (III-F), projection (III-B), and RRL (III-H) A safety layer that blends knowledge and data-driven approaches is created. Also, security constraints and linear projection are combined to improve computational speed.

•

Cum: Cumulative; Ins: Instantaneous; N.A.: Not applicable or not available.

Optimal power generation dispatch considering various constraints, ranging from simplified versions to security constraints, including economic dispatch, DC-OPF, AC-OPF, and SCOPF. The operation of a power system must meet both security and economic requirements. Considering credible contingencies, AC-OPF has been widely used [79, 86]. Most existing methods for solving OPF rely on analytical methods; however, given the inherently large scale of these problems, real-time computation is very challenging. A new variation of OPF is the SCOPF. This type of problem requires significantly longer computation times due to the additional security constraints [27]. To accelerate the calculation of SCOPF, methods such as DC-PF approximation [87], convex power flow approximation [88], and convex security constraint approximation [89] have been proposed. However, the accuracy of these methods has been questioned, and they remain time-consuming for large-scale systems. To accelerate computation and achieve better solutions, RL methods have been widely applied. Since traditional RL struggles to handle safety constraints effectively, safe RL has been further applied to address these issues.

The details of the applications of safe RL in optimal power generation dispatch are shown in Table II. Based on Table II, we summarize the foundational framework for implementing safe RL in optimal power generation dispatch with a specific example with SGs, RESs, and BESSs, incorporating strict physics-based constraints such as AC- and DC-PF constraints. If the system encompasses additional power system devices, the presented equations are designed to be readily scalable to accommodate them. Note that the models presented below are examples for illustration, and there are other RL formulations and models for optimal power generation dispatch depending on the specific problem setting. This is also true for other application domains. The state, action, reward, and constraints of optimal power generation dispatch are shown as follows.

IV-A1 AC-PF

AC-PF constraints describe the basic physics of power systems, which have been widely considered in optimal power generation dispatch, voltage control, unit commitments, etc.

State

The states include active and reactive loads and voltage:

\bm{s}^{\text{AC}}_{t}\triangleq\left(\bm{v}_{t},\bm{p}^{\text{Load}}_{t},\bm{% q}^{\text{Load}}_{t}\right)

(12)

Action

The control actions encompass both active and reactive power generation of SGs, active power generation of RESs, alongside power charging or discharging of BESSs:

\bm{a}^{\text{AC}}_{t}\triangleq\left(\bm{p}^{\text{SG}}_{t},\bm{q}^{\text{SG}% }_{t},\bm{p}^{\text{RES}}_{t},\bm{p}^{\text{BESS}}_{\text{ch},t},\bm{p}^{\text% {BESS}}_{\text{dis},t}\right)

(13)

Reward

The reward includes SGs generation cost, wind curtailment cost, and BESSs cost:


$\displaystyle\max_{\pi_{\theta}\in\Pi_{S}}$	$\displaystyle\mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^{\infty}\gamma^{t}R(\bm{% s}_{t},\bm{a}_{t},\bm{s}_{t+1})\right]$	(14a)
$\displaystyle R^{\text{AC}}(\bm{s},\bm{a})$	$\displaystyle=-\left\|\sum_{\forall i\in\mathcal{G}}\left(a^{\text{SG}}_{i}(p^{% \text{SG}}_{i,t})^{2}+b^{\text{SG}}_{i}p^{\text{SG}}_{i,t}+c^{\text{SG}}_{i}% \right)\right\|$
	$\displaystyle\quad-\sum_{\forall i\in\mathcal{R}}c^{\text{RES}}_{i}\left\|p^{% \text{RES}}_{\text{MPPT},i,t}-p^{\text{RES}}_{i,t}\right\|$
	$\displaystyle\quad-\sum_{\forall i\in\mathcal{B}}c^{\text{BESS}}_{\text{dis},i% }p^{\text{BESS}}_{\text{dis},i,t}+\sum_{\forall i\in\mathcal{B}}c^{\text{BESS}% }_{\text{ch},i}p^{\text{BESS}}_{\text{ch},i,t}$	(14b)
$\displaystyle\bm{s}^{\text{AC}}_{t}$	$\displaystyle=f_{t}(\bm{s}^{\text{AC}}_{t-1},\bm{a}^{\text{AC}}_{t-1})~{}~{}~{% }\bm{a}^{\text{AC}}_{t}\sim\pi(\bm{a}^{\text{AC}}_{t}\|\bm{s}^{\text{AC}}_{t-1})$	(14c)

Constraint

The control actions derived from DRL must adhere to physics-hard constraints. AC-PF constraints include bus active and reactive power balance constraints, SG active and reactive power generation constraints, RES active power generation constraints, voltage constraints, and branch apparent power constraints:


	$\displaystyle\mathbf{M}^{\text{BESS}}\bm{p}^{\text{BESS}}_{\text{dis},t}-% \mathbf{M}^{\text{BESS}}\bm{p}^{\text{BESS}}_{\text{ch},t}+\mathbf{M}^{\text{% SG}}\bm{p}_{t}^{\text{SG}}+$
	$\displaystyle\mathbf{M}^{\text{RES}}\bm{p}_{t}^{\text{RES}}-\bm{p}^{\text{Load% }}_{t}=\Re\{\mathbb{D}(\bm{v}_{t}\bm{v}_{t}^{\mathcal{H}}\mathbf{Y}^{\mathcal{% H}})\}$		(15a)
	$\displaystyle\mathbf{M}^{\text{SG}}\bm{q}_{t}^{\text{SG}}-\bm{q}^{\text{Load}}% _{t}=\Im\{\mathbb{D}(\bm{v}_{t}\bm{v}_{t}^{\mathcal{H}}\mathbf{Y}^{\mathcal{H}% })\}$		(15b)
	$\displaystyle\underline{\bm{p}}^{\text{SG}}\leq\bm{p}^{\text{SG}}_{t}\leq% \overline{\bm{p}}^{\text{SG}}~{}~{}~{}\underline{\bm{q}}^{\text{SG}}\leq\bm{q}% ^{\text{SG}}_{t}\leq\overline{\bm{q}}^{\text{SG}}$		(15c)
	$\displaystyle\underline{\bm{p}}^{\text{RES}}\leq\bm{p}^{\text{RES}}_{t}\leq% \overline{\bm{p}}^{\text{RES}}~{}~{}~{}\underline{\bm{v}}\leq\|{\bm{v}}\|\leq% \overline{\bm{v}}~{}~{}~{}\|{s}_{ij}\|\leq\overline{s}_{ij}$		(15d)

where $\mathbf{M}^{\text{SG}}$ denotes the matrix $\{0,1\}^{N\times G}$ that maps the generation vector $\bm{p}_{t}^{\text{SG}}\in\mathbb{R}^{|{\cal G}|}$ to $\mathbb{R}^{N}$ :


	$\displaystyle[\mathbf{M}^{\text{SG}}\bm{p}_{t}^{\text{SG}}]_{i}=0~{}~{}~{}[% \mathbf{M}^{\text{SG}}\bm{q}_{t}^{\text{SG}}]_{i}=0,~{}~{}\forall i\in\mathcal% {N}\setminus\mathcal{G}$		(16a)
	$\displaystyle[\mathbf{M}^{\text{SG}}\bm{p}_{t}^{\text{SG}}]_{i}=p^{\text{SG}}_% {j}~{}~{}~{}[\mathbf{M}^{\text{SG}}\bm{q}_{t}^{\text{SG}}]_{i}=q^{\text{SG}}_{% j},~{}~{}\forall i\in\mathcal{G},\forall j\in[G]$		(16b)

IV-A2 DC-PF

DC-PF constraints represent the linear relaxations of AC-PF, which are commonly included in optimal power generation dispatch and electricity market considerations.

State

The voltage and reactive power are overlooked in DC-PF.

\bm{s}^{\text{DC}}_{t}\triangleq\left(\bm{\vartheta}_{t},\bm{p}^{\text{Load}}_% {t}\right)

(17)

Action

The action involves only the generation or consumption of active power.

\bm{a}^{\text{DC}}_{t}\triangleq\left(\bm{p}^{\text{SG}}_{t},\bm{p}^{\text{RES% }}_{t},\bm{p}^{\text{BESS}}_{\text{ch},t},\bm{p}^{\text{BESS}}_{\text{dis},t}\right)

(18)

Reward

The reward is similar with the AC-PF (14).

Constraint

The DC-PF constraints are a simplification of the AC-PF constraints, retaining only the active power components and disregarding voltage issues [90].


	$\displaystyle\mathbf{M}^{\text{BESS}}\bm{p}^{\text{BESS}}_{\text{dis},t}-% \mathbf{M}^{\text{BESS}}\bm{p}^{\text{BESS}}_{\text{ch},t}+\mathbf{M}^{\text{% SG}}\bm{p}_{t}^{\text{SG}}+$		(19a)
	$\displaystyle\mathbf{M}^{\text{RES}}\bm{p}_{t}^{\text{RES}}-\bm{p}^{\text{Load% }}_{t}=\mathbf{B}\bm{\vartheta}_{t}$
	$\displaystyle\underline{\bm{p}}^{\text{SG}}\leq\bm{p}^{\text{SG}}_{t}\leq% \overline{\bm{p}}^{\text{SG}}~{}~{}~{}\underline{\bm{p}}^{\text{RES}}\leq\bm{p% }^{\text{RES}}_{t}\leq\overline{\bm{p}}^{\text{RES}}$		(19b)
	$\displaystyle\|{p}_{ij}\|\leq\overline{p}_{ij}$		(19c)

IV-A3 BESS Constraints

The BESS constraints include charging and discharging constraints, and SoC constraints.


	$\displaystyle 0\leq\bm{p}^{\text{BESS}}_{\text{ch},t}\leq\overline{\bm{p}}^{% \text{BESS}}_{\text{ch}}~{}~{}~{}0\leq\bm{p}^{\text{BESS}}_{\text{dis},t}\leq% \overline{\bm{p}}^{\text{BESS}}_{\text{dis}}$		(20a)
	$\displaystyle\underline{\bm{SoC}}^{\text{BESS}}\leq\bm{SoC}^{\text{BESS}}_{t}% \leq\overline{\bm{SoC}}^{\text{BESS}}$		(20b)
	$\displaystyle\bm{SoC}^{\text{BESS}}_{t}=\bm{SoC}^{\text{BESS}}_{t-1}+\frac{% \Delta t}{E_{\text{cap}}^{\text{BESS}}}\Big{(}\eta^{\text{BESS}}_{\text{ch}}% \bm{p}^{\text{BESS}}_{\text{ch},t}-\frac{\bm{p}^{\text{BESS}}_{\text{dis},t}}{% \eta^{\text{BESS}}_{\text{dis}}}\Big{)}$		(20c)

IV-B Voltage Control

TABLE III: Safe RL Applications in Voltage Control

Research Problem/ Objective Constraint Cum/Ins Hard/Soft Safety Constraint Techniques Key Features [33] Minimize transmission losses Voltage and other system constraints Ins/Hard Projection layer (III-B) Through an embedded safe policy projection, it is possible to smoothly and effectively limit the action space, thereby preventing any breach of physical constraints. [40] Minimize cost Voltage constraint Ins/Hard Lyapunov stability (III-C) Ensuring that each NN controller satisfies certain Lipschitz constraints to inherently meet these constraints, thus guaranteeing the system maintains exponential stability. [91] Minimize transmission loss Voltage and power flow constraints Ins/Hard Finite iteration projection (III-B) A finite iteration projection algorithm is proposed to guarantee hard constraints by converting a non-convex optimization problem into a finite iteration problem. [52] Minimize the cost of network loss and device switching Voltage and power flow constraints Cum/Hard Safety layer (III-F) A safety layer is added to the policy NN to enhance operational constraint satisfaction for both initial exploration phase and convergence phase. [17] Minimize total network energy loss Voltage deviations Cum/Soft Primal-dual policy (III-A) Each zone has a central control agent that embeds GCNs to improve the decision-making capability. The primal-dual method is used to rigorously satisfy voltage safety constraints. [92] Minimize active power loss Voltage violations Cum/Soft Lagrangian relaxation (III-A) A MACSAC RL algorithm is proposed, which is utilized to train control agents online, eliminating the need for accurate ADN models. [47] Active voltage control SoC of BESSs Ins/Hard Physics-based shielding (III-E) The physics-shielded MATD3 algorithm is proposed, capable of replacing dangerous actions with safe ones as the BESSs approach dangerous SoC. [93] Minimize the ADN power losses and control efforts Voltage and power grid constraints Ins/Hard Safety layer (III-F) A safety layer is directly integrated on top of the DDPG actor network to forecasts changes in constrained states and prevents the violation of operational constraints in ADNs. [94] Minimize the network power loss Nodal voltage constraint Ins/Hard Safety projection (III-B) In the training stage, the safety projection is added to the combined policy to analytically solve an action correction formulation to achieve guaranteed 100% voltage security. [25] Minimize the cost of losses and the device switching Voltage constraint Ins/Soft Lagrangian relaxation (III-A) A safe off-policy DRL, Constrained SAC, is proposed to solve Volt-VAR control problems in a model-free manner. [95] Minimize the total control cost Voltage constraint Ins/Hard Safety projection layer (III-B) By leveraging the underlying grid information, a projection layer is designed to project the reactive power injection into a safe set of nodal voltage magnitudes. [41] Minimize the voltage deviation and control cost Voltage constraint Ins/Hard Lyapunov function (III-C) An explicitly constructed Lyapunov function is utilized to certify stability for all monotone policies without knowledge of the underlying model parameters. [96] Minimize the cost of electricity and BESSs maintenance Voltage constraint and ADN constraints Cum/Soft SAC with safety module A model-free DRL algorithm, integrated with a safety module, is proposed to minimize voltage violations and real power losses, with a design that guarantees no voltage violations occur during the online training. [35] Minimize the total operation costs Physical constraints Cum/Hard CPO (III-B) The voltage control problem is formulated as a CMDP and solved by TRPO and CPO to enable safe exploration. [16] Minimize voltage violations and network losses Voltage bound constraints Cum/Soft Penalty function and RRL (III-H) An adversarial RL algorithm has been developed to train an offline agent that is robust against model mismatches.

Voltage control is designed to ensure the magnitudes of voltage across power networks remain close to nominal values or within an acceptable range. For example, Fig. 9 shows the Volt/Var/Watt curves of voltage control [97]. Instead of directly controlling the active and reactive power injections of smart inverters, some researchers have proposed resetting the Volt/Var/Watt curves to control the voltage profiles [98, 99]. Increasing penetration levels of RESs, such as the large-scale deployment of wind farms in transmission systems and the widespread installation of distributed PVs and EVs in distribution networks, have led to significant changes in power system behavior. Due to the distribution networks typically being radial or distributed in structure and connecting a large number of intermittent and uncertain distributed RESs, voltage management has become more complex and challenging, often leading to voltage violations (either below 0.95 p.u. or above 1.05 p.u.) [100, 101]. Many current studies on voltage regulation utilize a physical model-based optimization/control method, employing convex relaxation techniques like second-order cone programming to simplify AC-PF constraints. This approach allows for efficient resolution using conventional solvers [33, 25, 102]. The application of Safe RL in the area of voltage control is detailed in Table III. According to Table III, we take the smarter inverters of DGs and BESSs as a prime example to summarize the voltage control problem associated with safe RL. The state, action, reward, and constraints of voltage control are shown as follows:

IV-B1 Volt/Var Control with AC-PF Constraints

State

The state variables are represented by PMU measurements, with sensors installed at buses denoted by $\mathcal{N}^{\text{PMU}}$ , or AMI measurements, with sensors installed at buses denoted by $\mathcal{N}^{\text{AMI}}$ . Thus, the state variable $\bm{s}$ is comprehensively defined by:


$\displaystyle\bm{s^{\text{PMU}}}$	$\displaystyle\triangleq\left((v_{i})_{i\in\mathcal{N}^{\text{PMU}}},(i_{i})_{i% \in\mathcal{N}^{\text{PMU}}}\right)$	(21a)
$\displaystyle\bm{s}^{\text{AMI}}$	$\displaystyle\triangleq\left(({\|v_{i}\|}^{2})_{i\in\mathcal{N}^{\text{AMI}}},({% \|i_{i}\|}^{2})_{i\in\mathcal{N}^{\text{AMI}}},(s_{ap,i})_{i\in\mathcal{N}^{% \text{AMI}}}\right)$	(21b)

The system dynamics that depict the environment can be formulated as

\displaystyle\bm{s}^{\text{V}}_{t+1}\triangleq\bm{f}(\bm{s}^{\text{V}}_{t},\bm% {a}^{\text{V}}_{t})

(22)

Action

The control actions include regulating the DGs, BESSs, and other components.

\bm{a}^{\text{V}}_{t}\triangleq\left(\bm{p}^{\text{DG}}_{t},\bm{q}^{\text{DG}}% _{t},\bm{p}^{\text{BESS}}_{t},\bm{p}^{\text{other}}_{t}\right)

(23)

Reward

The reward is to maintain the voltage magnitudes close to the nominal value $v_{\text{ref}}$ (typically 1.0 p.u.):

R^{\text{V}}(\bm{s},\bm{a})=-\|{\bm{v}_{t}-v_{\text{ref}}}\|

(24)

Another kind of reward design is a soft mechanism based on an acceptable range:

R^{\text{V}}(\bm{s},\bm{a})=-\sum_{i\in\cal N}\big{(}[{v}_{i}-\overline{v}]_{+% }+[\underline{v}-{v}_{i}]_{+}\big{)}

(25)

Constraint

The constraint for the active and reactive power injections of DGs is given by:

(\bm{p}^{\text{DG}})^{2}+(\bm{q}^{\text{DG}})^{2}\leq(\bar{\bm{s}}_{\text{ap}}% ^{\text{DG}})^{2}

(26)

However, [97] points out that the stability regions are more constrained than in Equation (26). For simplicity, we omit the specific equations. Figure 9 illustrates the piece-wise linear equations that constrain the battery system’s active and reactive power injections within the blue feasible region, while the solar panel inverters are only in the right region, as they do not have a discharging process, i.e., $p\geq 0$ .

IV-B2 Volt/Var Control with LinDistFlow Constraints

The LinDistFlow linearized branch flow model is applied within a tree-structured distribution network. The system consists of a set of nodes $\mathcal{N}_{+0}=\{0,1,\cdots,N\}$ and an edge set $\mathcal{E}$ . Node 0 is known as the substation, and $\mathcal{N}=\mathcal{N}_{+0}/\{0\}$ denotes the set of nodes excluding the substation node. Each node $i\in\mathcal{N}$ is associated with an active power injection $p_{i}$ and a reactive power injection $q_{i}$ . Let $V_{i}$ be the squared voltage magnitude, and let $p,q$ and $V$ denote $\{p_{i},q_{i},V_{i}\}_{i\in\mathcal{N}}$ stacked into a vector. The variables satisfy the following equations, $\forall i\in\mathcal{N}$ ,


$\displaystyle p_{i}$	$\displaystyle=-p_{ji}+\sum_{k:(i,k)\in\mathcal{E}}p_{ik}$	(27a)
$\displaystyle q_{i}$	$\displaystyle=-q_{ji}+\sum_{k:(i,k)\in\mathcal{E}}q_{ik}$	(27b)
$\displaystyle v_{i}$	$\displaystyle=v_{j}-2(r_{ij}p_{ji}+x_{ji}q_{ji})$	(27c)

where $j$ is the parent node of $i$ in the distribution network. (27c) can be written in the vector form:

\bm{v}=\mathbf{r}\bm{p}+\mathbf{x}\bm{q}+v_{0}\mathbf{1}=\mathbf{x}\bm{q}+\bm{% v}_{\text{env}}

(28)

where $\bm{v}_{\text{env}}=\mathbf{r}\bm{p}+v_{0}\mathbf{1}$ represents the component that cannot be controlled; $\mathbf{r}=[2r_{ij}]^{N\times N}$ and $\mathbf{x}=[2x_{ij}]^{N\times N}$ are matrices defined correspond to the parameters $r_{ij}$ and $x_{ij}$ , respectively.

State

The state of LinDistFlow is also determined by PMU and AMI measurements, similar to the AC-PF (21).

Action

The control actions is a map** from the voltage to reactive power, which is defined by:

\bm{a}^{\text{V}}_{t}=\Delta\bm{q}_{t}\triangleq\bm{q}_{t}-\bm{q}_{t+1}

(29)

The system dynamics can be given as

\bm{v}_{t+1}=\mathbf{r}\bm{p}+\mathbf{x}(\bm{q}_{t}-\bm{a}^{\text{V}}_{t})+v_{% 0}\mathbf{1}

(30)

where $\bm{p}$ lacks a time subscript because it pertains to a fast-response control mechanism, and the active power injection is assumed to be constant.

Reward

The reward is also designed to keep the voltage close to its nominal value (24) or within its maximum and minimum limits (25).

Constraint

The constraints include maximum and minimum value limits and the stability of the action:


	$\displaystyle~{}\underline{\bm{a}}^{\text{V}}\leq\bm{a}^{\text{V}}_{t}\leq% \overline{\bm{a}}^{\text{V}}$		(31a)
	$\displaystyle\bm{a}^{\text{V}}_{t}~{}\text{is stabilizing}$		(31b)

IV-B3 Safe RL for Voltage Control

In recent years, the integration of DERs such as rooftop solar panels and EVs has led to rapid and unpredictable fluctuations in the generation and load profiles of distribution systems. These fluctuations pose significant challenges in real-time voltage control for distribution grids. Recently, RL has emerged as a powerful approach for addressing model-free nonlinear control problems, generating considerable interest in develo** RL-based controllers to optimize the transient performance of voltage control problems. Safe RL has been effectively implemented to ensure adherence to voltage and transient stability constraints.

In the future, the focus is shifting toward distributed voltage regulation, driven by the limitations of centralized voltage regulation, which requires a central controller and is susceptible to single-point failures and significant communication burdens. Consequently, distributed voltage regulation, which only requires the exchange of local information with neighboring units, has attracted considerable research interest as a promising direction for future development [17].

IV-C Stability Control

TABLE IV: Safe RL Applications in Stability Control

Research Problem/ Objective Constraint Cum/Ins Hard/Soft Safety Constraint Techniques Key Features [56] Preventive control for transmission overload relief Safety, generation, and network constraints Cum/Hard IPO (III-G) The IPO method’s efficacy is boosted by leveraging spatial-temporal correlations in power grid nodal and edge features. [57] Emergency control for under voltage load-shedding Transient voltage stability Cum/Hard Barrier function (III-G) The safe RL method employs a reward function with a time-dependent barrier function that approaches negative infinity as the system state nears the safety bounds. [103] Emergency load-shedding control Rated capacity, current, voltage and others Cum/Soft Lagrangian relaxation (III-A) Two DRL strategies are designed to tackle intricate power system control challenges in a data-driven manner, aiming to preserve power system stability. [104] Transient and steady-state voltage control Reactive power capacity constraints Ins/Hard Lagrangian relaxation (III-A) and barrier function (III-G) Based on the safe gradient flow framework, the design employs a control barrier function to ensure that given dynamics never leave a safe set. [105] Frequency control Operational constraints Cum/Soft Safety model (III-F) A safety model is proposed comprising two parts: one to check if actions meet safety standards, and another to suggest new actions if they don’t. [106] Minimize the control cost Frequency limit Cum/Hard Barrier function (III-G) A novel self-tuning control barrier function is designed to actively compensate the unsafe frequency control strategies under variational safety constraints. [107] Primary frequency control Frequency constraint Ins/Hard Gauge map (III-F) A closed-form gauge map is proposed, which maps NN outputs from unsafe actions to the set of safe actions. [108] Frequency control Operational safety constraints Cum/Soft Lagrangian relaxation (III-A) Safety is considered during the action search process to ensure that various operational constraints are satisfied while the agent interacts with the environment. [39] Primary frequency control Frequency stability constraints Cum/Hard Lyapunov method (III-C) A Lyapunov function is integrated in the structural properties of controllers, guaranteeing local asymptotic stability. An RNN-based framework that incorporates frequency state transition dynamics is used to train controllers. [109] Wide-area dam** control System constraints Ins/Hard Bounded exploratory control The agent uses DNN and DRL to identify and track the dynamics of the system and automatically takes actions to stabilize the system. [110] Minimize large frequency oscillations Mean-variance risk measure Cum/Soft Lagrangian relaxation (III-A) The risk-constrained linear quadratic regulator problem is addressed through dual reformulation into a minimax problem, utilizing a RL method. [111] FACTS setpoint control Physical constraints Cum/Soft Lagrangian relaxation (III-A) Model-based methods may underperform when faced with topology errors. RL improves by interacting with the environment, bypassing the need for updating network parameters.

Power system stability control focuses on decision-making to prevent the system from entering undesired situations, especially to avert large catastrophic faults. Considering the sequence of control actions and contingencies, stability control is divided into two main categories: preventive and emergency control. Preventive security control aims to prepare the system while it is still in normal operation, ensuring it can satisfactorily handle future contingencies. In contrast, emergency control is initiated after contingencies have already occurred, with the objective of controlling the system’s dynamics to minimize consequences [112]. Preventive control and emergency control typically have high time requirements, with emergency control being even more time-sensitive, often requiring actions within tens of milliseconds.

From the perspective of key system variables that can indicate unstable behavior, traditional power system stability issues are classified into rotor angle stability, frequency stability, and voltage stability [113]. Considering the extensive integration of power electronic devices, power system stability issues have further expanded to include resonance stability and converter-driven stability [114]. Due to the complexity of stability issues and the rapidly changing system states, traditional analytical methods may struggle to find solutions and face computational efficiency limitations. However, RL and safe RL can efficiently address these challenges. The details of the applications of safe RL in stability control are shown in Table IV.

IV-C1 Frequency control

State

The state is the frequency $\omega$ and rotor angle $\delta$ :

\displaystyle\bm{s}^{\text{F}}\triangleq\left(\bm{\omega}_{t},\bm{\delta}_{t}\right)

(32)

Action

The control actions $\bm{a}_{t}$ are implemented through the control of active power injections:

\bm{a}^{\text{F}}\triangleq\left(\bm{p}^{\text{SG}}_{t},\bm{p}^{\text{RES}}_{t% },\bm{p}^{\text{Load}}_{t}\right)

(33)

Reward

The reward is to minimize the frequency deviation and control action cost:

R^{\text{F}}(\bm{s},\bm{a})=-\sum_{i\in\mathcal{N}}\left(\|\Delta\omega_{i}\|_% {\infty}+\lambda h_{i}(u_{i})\right)

(34)

where $\|\Delta\omega_{i}\|_{\infty}$ represents the maximum frequency deviation during the time horizon; the cost function $h_{i}(u_{i})$ is a Lipschitz-continuous function; the cost coefficient $\lambda$ is used to balance the cost of actions relative to the frequency deviations.

Constraint

The system frequency dynamics is given by the swing equation:


	$\displaystyle\dot{\delta}_{i}=\omega_{i}$	(35a)
$\displaystyle M_{i}\dot{\omega}_{i}=p^{\text{Bus}}_{i}-D_{i}\Delta\omega_{i}$	$\displaystyle-a^{\text{F}}_{i}(\omega_{i})-\sum_{j=1}^{n}B_{ij}\sin{(\Delta% \delta)}$	(35b)

where $\dot{\delta}$ and $\dot{\omega}$ represent the time derivatives $d\delta/dt$ and $d\omega/dt$ , respectively; $\sum_{j=1}^{n}B_{ij}\sin(\Delta\delta)$ denotes the electrical power $p_{e,i}$ at each node $i$ ; the mechanical power $p_{m,i}$ is expressed as $p^{\text{Gen}}_{i}-\frac{\omega_{i}}{R_{i}}$ ; the bus power injection $p^{\text{Bus}}_{i}$ is defined as $p^{\text{Gen}}_{i}-p^{\text{Load}}_{i}$ . Other constraints are:


$\displaystyle\|p_{ij}\|$	$\displaystyle\leq\overline{p}_{ij}~{}~{}~{}\underline{\bm{a}^{\text{F}}}\leq% \bm{a}^{\text{F}}(\bm{\omega})\leq\overline{\bm{a}^{\text{F}}}$	(36a)
	$\displaystyle\bm{a}^{\text{F}}(\bm{\omega})~{}\text{is stabilizing}$	(36b)

where the requirement that $\bm{a}^{\text{F}}(\bm{\omega})$ must be stabilizing is defined using various methods, such as Lyapunov Stability [39].

IV-D EV Charging Control

TABLE V: Safe RL Applications in EV Charging Control

Research Problem/ Objective Constraint Cum/Ins Hard/Soft Safety Constraint Techniques Key Features [115] Minimize the EV charging cost Constraints of action, entropy and SoC deviation Cum/Soft Lagrangian relaxation (III-A) A model-free safe DRL algorithm is proposed to optimize real-time EV charging and discharging schedules without requiring accurate information on the arrival and departure times, remaining energy, and real-time electricity prices. [116] Energy management for plug-in hybrid EV Physical constraints of components Cum/Soft Lagrangian relaxation (III-A) By employing Lagrangian relaxation, the optimization for CMDP transforms into an unconstrained dual problem aimed at minimizing energy consumption. [117] Maximize the total profit Limitations of power and demands Cum/Soft Lagrangian relaxation (III-A) A detailed microgrid system is proposed, featuring a large CS, various EVs, V2G capabilities, and the non-linear charging behavior of EVs. [118] Maximize the revenue of electricity selling EV charging constraint Cum/Soft Lagrangian relaxation (III-A) The formulation takes into account the randomness of the EV’s arrival time, departure time, and remaining energy, as well as the real-time electricity price. [34] Smooth out the load profile of a parking lot Constraints of EV charging and bound Ins/Hard Penalty function and projection method (III-B) Two penalty functions are designed: one to ensure the system charges the EV with sufficient energy, and the other to check if an action exceeds the upper bound of the action space. [26] Optimal EV charging control Constraints of EV Ins/Hard Lagrangian relaxation (III-A) and projection (III-B) The primary objective is to optimize the distribution of power within network boundaries by effectively managing power generation, EVs, and ESSs. [119] Minimize the vehicle energy consumption Constraints of battery power bound Ins/Hard Shielding method (III-E) The shield transforms the agent’s desired action into a safe action for the environment. The desired action is only altered if it violates the safety rule embedded in the shield.

The Paris Agreement recognizes EVs as a significant tool for reducing carbon emissions, leading to their widespread and vigorous development by countries globally. EVs’ penetration reached almost 30 million in 2022 and is expected to grow to about 240 million by 2030 in the stated policies scenario, achieving an average annual growth rate of about 30%. Based on this trend, EVs will account for over 10% of the road vehicle fleet by 2030 [120]. However, the stochastic nature of EV charging can introduce unpredictable peak loads and voltage deviations in the power system. To address these issues, demand response for EVs has been proposed to mitigate grid peak loads and charging costs. Further complexity in optimizing charging arises due to the need to factor in current electricity prices and required charging energy for EV charging and discharging. Additionally, the operation of certain EVs in V2G mode, enabling them to sell electricity back to the grid, adds another layer of complexity[121]. To tackle the uncertainty associated with EVs RL and safe RL methods offer promising solutions to train effective charging strategies that achieve state of the art performance [115]. Next, we describe the state, action, reward, and constraints of EV charging control.

State

The states include the SoC of EVs $\bm{SoC}^{\text{EV}}_{t}$ , the amount of charge the EVs requires $\bm{p}^{\text{EV}}_{d,t}$ , the parking time of EVs $\bm{t}^{\text{EV}}_{p}$ , the electricity price for charging from the grid to the EVs $\bm{\Lambda}^{\text{EV}}_{\text{ch},t}$ , the electricity price for selling from the EVs to the grid $\bm{\Lambda}^{\text{EV}}_{\text{dis},t}$ , power generated by the RESs $\bm{p}^{\text{RESs}}_{t}$ , load demand of other loads $\bm{p}^{\text{Load}}_{t}$ which determines the state of the grid [117, 115]:

\bm{s}^{\text{EV}}_{t}\triangleq\left(\bm{SoC}^{\text{EV}}_{t},\bm{p}^{\text{% EV}}_{d,t},\bm{t}^{\text{EV}}_{p},\bm{\Lambda}^{\text{EV}}_{\text{ch},t},\bm{% \Lambda}^{\text{EV}}_{\text{dis},t},\bm{p}^{\text{Load}}_{t}\right)

(37)

Action

In existing research on EV charging management, the actions are primarily the charging power $\bm{p}^{\text{EV}}_{\text{ch},t}$ and discharging power $\bm{p}^{\text{EV}}_{\text{dis},t}$ [118, 117, 115]:

\bm{a}^{\text{EV}}_{t}\triangleq\left(\bm{p}^{\text{EV}}_{\text{ch},t},\bm{p}^% {\text{EV}}_{\text{dis},t}\right)

(38)

Reward

The reward includes minimizing the charging cost associated with the time-varying electricity prices, maximizing the revenue from selling electricity from EVs back to the grid, and aligning the SoC closely with the target value [115, 117]:


$\displaystyle R^{\text{EV}}(\bm{s},\bm{a})$	$\displaystyle=-R^{\text{EV}}_{\text{cost}}+R^{\text{EV}}_{\text{rev}}-R^{\text% {EV}}_{SoC}$	(39a)
$\displaystyle R^{\text{EV}}_{\text{cost}}$	$\displaystyle=\bm{\Lambda}^{\text{EV}}_{\text{ch},t}\bm{p}^{\text{EV}}_{\text{% ch},t}$	(39b)
$\displaystyle R^{\text{EV}}_{\text{rev}}$	$\displaystyle=\bm{\Lambda}^{\text{EV}}_{\text{dis},t}\bm{p}^{\text{EV}}_{\text% {dis},t}$	(39c)
$\displaystyle R^{\text{EV}}_{SoC}$	$\displaystyle=\|\bm{SoC}^{\text{EV}}_{t}-\bm{SoC}^{\text{EV}}_{\text{target}}\|,$	(39d)

where (39b), (39c) and (39d) respectively represent the rewards for electricity charging cost, electricity selling revenue, and EVs charging satisfaction.

Constraint

Generally, EVs act as controllable loads within the electrical grid, with specific requirements for charging. When considering the V2G mode, the modeling of EVs is similar to that of BESS [117]:


	$\displaystyle 0\leq\bm{p}^{\text{EV}}_{\text{ch},t}\leq\overline{\bm{p}}^{% \text{EV}}_{\text{ch}}~{}~{}~{}0\leq\bm{p}^{\text{EV}}_{\text{dis},t}\leq% \overline{\bm{p}}^{\text{EV}}_{\text{dis}}$		(40a)
	$\displaystyle\underline{\bm{SoC}}^{\text{EV}}\leq\bm{SoC}^{\text{EV}}_{t}\leq% \overline{\bm{SoC}}^{\text{EV}}$		(40b)
	$\displaystyle\bm{SoC}^{\text{EV}}_{t}=\bm{SoC}^{\text{EV}}_{t-1}+\frac{\Delta t% }{E_{\text{cap}}^{\text{EV}}}\Big{(}\eta^{\text{EV}}_{\text{ch}}\bm{p}^{\text{% EV}}_{\text{ch},t}-\frac{\bm{p}^{\text{BESS}}_{\text{dis},t}}{\eta^{\text{EV}}% _{\text{dis}}}\Big{)}$		(40c)

where (40a) and (40b) indicate the EV constraints on SoC, charging and discharging power, and SoC; (40c) represents the SoC update process of EVs. Also, most EVs require a target SoC at a specified time $t$ :

\bm{SoC}^{\text{EV}}_{t}\geq\bm{SoC}^{\text{EV}}_{\text{target}}

(41)

A comprehensive review of the application of safe RL on EV charging control is provided in Table V. In Table V, most objectives focus on reducing EV charging costs, whereas in [34], the emphasis is on peak shaving and valley filling, to smooth the electric net-load profile. In terms of specific safe RL technologies, most papers employ methods based on a Lagrangian relaxation [115, 116, 117, 118, 26]. Exceptions are [34] which utilizes penalty functions, and [119] which adopts the shielding method.

IV-E Building Energy Management

TABLE VI: Safe RL applications in building energy management

Research Problem/ Objective Constraint Cum/Ins Hard/Soft Safety Constraint Techniques Key Features [122] Tropical air free-cooled data center control Constraints of temperature and humidity Cum/Soft Lagrangian relaxation (III-A) By controlling the supply and exhaust fans, the cooling coil, and the dampers, the temperature and relative humidity of the air supplied to the servers are maintained below thresholds. [123] Dynamic thermal management in data center buildings Constraints of equipment temperature Cum+Ins/ Hard+Soft Lagrangian relaxation (III-A) and shielding (III-E) Lagrangian-based constrained DRL and reward sha** are used to minimize soft violations. Parameterized shielding is employed to effectively avoid extreme temperature violations. [48] Data center building cooling Constraints of zone temperature Ins/Hard Shielding method (III-E) Shielding is avoided during training to not impede full exploration. An approach integrating empirical thermodynamics knowledge with data-driven models is proposed. [124] Multi-energy management of smart home Constraints of components in the smart home Cum/Soft PDO (III-A) By employing PDO, the Lagrangian relaxation coefficients for cost functions are automatically adjusted during the training, thereby minimizing both energy bills and the constraint costs. [125] District cooling system control Power constraint Ins/Hard Safety layer (III-F) A model-free DRL method is proposed that operates without needing an accurate system model or uncertainty distribution, utilizing a self-adaptive reward function to limit peak power. [126] Energy savings in building energy systems Constraints of indoor temperature demand Ins/Hard Shielding method (III-E) Implicit and explicit safety policies are combined through online residual learning, enabling real-time safety by filtering out unsafe actions, overcoming the limitations of relying solely on penalty-based rewards. [127] Safe building HVAC control Constraints of building Cum/Soft Safety-aware objective To ensure safe exploration, Gaussian noise is added to a hand-crafted rule-based controller. Adjusting the noise’s variance helps balance the diversity and safety. [128] Resilient proactive scheduling of building Constraints of components of building Cum/Soft Adaptive reward Conditional-value-at-risk are used to handle uncertainties from extreme weather events, significantly reducing their impact on the learning process and achieving a balanced approach between exploration and exploitation. [129] Real-time control in a smart energy-hub Physical constraints of energy hub Cum/Soft Safety-guided function A safety-guided function calculates the action-value function based on accumulated safety, determining the trajectory’s safety under the current policy projected into the future. [130] Optimal dispatch of an energy hub Constraints of energy balance and equipment Cum/Soft Primal-dual method (III-A) The approach blends imitation learning for lower costs and primal-dual optimization to meet constraints, working better than using either method alone.

In 2022, the global buildings sector was a major energy consumer, accounting for 30% of the final energy demand, primarily for operational needs like heating and cooling [131]. Energy hubs, connected to both the electric grid system and the natural gas network, cater to three types of energy demands: electrical, heating, and cooling, by controlling RESs, ESSs, EHPs, GBs, and HVAC systems [130]. Therefore, effective control of cooling or HVAC systems for buildings and energy hubs is necessary. Traditional cooling control relies on feedback control, whereas RL has the ability to self-learn and adapt in uncertain and complex environments, making it widely applied in recent years. Building energy management aims to minimize energy consumption while meeting the constraints of thermal-related equipment, such as HVAC, EHP, GB, and the demands for electricity and heat, as well as environmental constraints like temperature and humidity, as detailed in Table VI.

This review highlights models that demonstrate the integration of HVAC systems with power systems, particularly through safe RL controls. We explore the state, action, reward, and constraints associated with the RL control of HVAC and power systems, providing specific examples within the context of energy management in HVAC as follows:

State

The state of the building, in relation to HVAC systems, includes indoor and outdoor temperature $T^{I/O}$ , humidity $H$ , actual airflow rate $\bm{s}^{\text{air}}$ , actual ventilation rate $\bm{s}^{\text{ven}}$ [132]. Additionally, it covers BESS SoC $\bm{SoC}^{\text{BESS}}$ , TESS SoC $\bm{SoC}^{\text{TESS}}$ , CHP state $\bm{s}^{\text{CHP}}$ , GB state $\bm{s}^{\text{GB}}$ and EHP state $\bm{s}^{\text{EHP}}$ , and core operational equipment state, like IT equipment temperature $T^{\text{IT}}$ , and human satisfaction indicators $\bm{s}^{\text{Human}}$ , like thermal comfort index, and exogenous state, like grid electricity prices $\bm{\Lambda}^{\text{Ele}}$ , grid gas price $\bm{\Lambda}^{\text{Gas}}$ and carbon price $\bm{\Lambda}^{\text{Car}}$ [126, 132, 129].

	$\displaystyle\bm{s}^{\text{Building}}_{t}\triangleq(T^{I},T^{O},H,\bm{s}^{% \text{air}},\bm{s}^{\text{ven}},\bm{SoC}^{\text{BESS}},\bm{SoC}^{\text{TESS}},$			(42)
	$\displaystyle\bm{s}^{\text{CHP}},\bm{s}^{\text{GB}},\bm{s}^{\text{EHP}},T^{% \text{IT}},\bm{s}^{\text{Human}},\bm{\Lambda}^{\text{Ele}},\bm{\Lambda}^{\text% {Gas}},\bm{\Lambda}^{\text{Car}}$	$\displaystyle)$		(42)

Action

Building energy management for HVAC is primarily achieved through the management of energy control equipment, including temperature setpoint $T_{\text{set}}$ , humidity setpoint $H_{\text{set}}$ , airflow rate $\bm{a}^{\text{air}}$ , ventilation rate $\bm{a}^{\text{ven}}$ , BESS charge or discharge amount $\bm{p}^{\text{BESS}}_{ch/dis}$ , TESS charge or discharge amount $\bm{h}^{\text{TESS}}_{ch/dis}$ , electricity generated by CHP $\bm{p}^{\text{CHP}}$ , heat generated by CHP $\bm{h}^{\text{CHP}}$ , GB $\bm{h}^{\text{GB}}$ and EHP $\bm{h}^{\text{EHP}}$ , and RESs output $\bm{p}^{\text{RES}}$ [127].

	$\displaystyle\bm{a}^{\text{Building}}_{t}\triangleq(T_{\text{set}},H_{\text{% set}},\bm{a}^{\text{air}},\bm{a}^{\text{ven}},\bm{p}^{\text{BESS}}_{\text{ch}/% \text{dis}},$			(43)
	$\displaystyle\bm{h}^{\text{TESS}}_{\text{ch}/\text{dis}},\bm{p}^{\text{CHP}},% \bm{h}^{\text{CHP}},\bm{h}^{\text{GB}},\bm{h}^{\text{EHP}},\bm{p}^{\text{RES}}$	$\displaystyle)$		(43)

Reward

The reward is to minimize the total energy cost, such as the cost of electricity, natural gas, heat, and device long-term degradation, especially for BESSs and TESSs. For some research papers that require specific room temperature ranges, temperature deviations are often included in the reward calculations.

R^{\text{Building}}(\bm{s},\bm{a})=-(R_{\text{cost}}+R_{\text{degrade}}+\Delta T)

(44)

where three components represent the rewards for cost, device degradation, and temperature deviation, respectively.

Constraint

The generation and consumption of electrical and thermal energy are equal, complying with the electrical and thermal balance equations [129, 124].


	$\displaystyle\bm{p}^{\text{Grid}}_{t}+\bm{p}^{\text{RESs}}_{t}+\bm{p}^{\text{% BESS}}_{\text{dis},t}+\bm{p}^{\text{CHP}}_{t}=$
	$\displaystyle\bm{p}^{\text{HVAC}}_{t}+\bm{p}^{\text{Load}}_{t}+\bm{p}^{\text{% EV}}_{t}+\bm{p}^{\text{BESS}}_{\text{ch},t}+\bm{p}^{\text{EHP}}_{t}$		(45a)
	$\displaystyle\bm{h}^{\text{CHP}}_{t}+\bm{h}^{\text{GB}}_{t}+\bm{h}^{\text{TESS% }}_{\text{dis},t}+\bm{h}^{\text{EHP}}_{t}=\bm{h}^{\text{TL}}_{t}+\bm{h}^{\text% {TESS}}_{\text{ch},t}$		(45b)

The constraints of BESS have already been shown in (20). The constraints of TESS are similar to BESS:


	$\displaystyle\bm{0}\leq\bm{h}_{\text{ch},t}\leq\overline{\bm{h}}^{\text{TESS}}% _{\text{ch},t}~{}~{}~{}\bm{0}\leq\bm{h}^{\text{TESS}}_{\text{dis},t}\leq% \overline{\bm{h}}^{\text{TESS}}_{\text{dis}}$		(46a)
	$\displaystyle\underline{\bm{SoC}}^{\text{TESS}}\leq\bm{SoC}^{\text{TESS}}_{t}% \leq\overline{\bm{SoC}}^{\text{TESS}}$		(46b)
	$\displaystyle\bm{SoC}^{\text{TESS}}_{t}=\bm{SoC}^{\text{TESS}}_{t-1}+\frac{% \Delta t}{E_{\text{cap}}^{\text{TESS}}}\Big{(}\eta^{\text{TESS}}_{\text{ch}}% \bm{h}^{\text{TESS}}_{\text{ch},t}-\frac{\bm{h}^{\text{TESS}}_{\text{dis},t}}{% \eta^{\text{TESS}}_{\text{dis}}}\Big{)}$		(46c)

CHP is a single-input-multi-output converter with high electrical and thermal energy efficiency, and its constraints are as follows [129]:


$\displaystyle\bm{p}^{\text{CHP}}_{t}=\eta^{\text{CHP}}_{p}\bm{g}^{\text{CHP}}_% {t}~{}~{}~{}$	$\displaystyle\bm{h}^{\text{CHP}}_{h}=\eta^{\text{CHP}}_{h}\bm{g}^{\text{CHP}}_% {t}$	(47a)
$\displaystyle\bm{0}\leq\bm{p}^{\text{CHP}}_{t}\leq\overline{\bm{p}}^{\text{CHP% }}~{}~{}~{}$	$\displaystyle\bm{0}\leq\bm{h}^{\text{CHP}}_{h}\leq\overline{\bm{h}}^{\text{CHP}}$	(47b)

where (47a) indicates the efficiency of converting natural gas into electric power $\bm{p}^{\text{CHP}}_{t}$ and heat power $\bm{h}^{\text{CHP}}_{h}$ ; (47b) represents the range of $\bm{p}^{\text{CHP}}_{t}$ and $\bm{h}^{\text{CHP}}_{h}$ .

GB and EHP respectively convert natural gas and electricity into heat to meet the heating demand, which can be represented as follows [130]:


$\displaystyle\bm{h}^{\text{GB}}_{h}=\eta^{\text{GB}}\bm{g}^{\text{GB}}_{t}~{}~% {}~{}$	$\displaystyle\bm{h}^{\text{EHP}}_{t}=\eta^{\text{EHP}}\bm{p}^{\text{EHP}}_{t}$	(48a)
$\displaystyle\bm{0}\leq\bm{h}^{\text{GB}}_{h}\leq\overline{\bm{h}}^{\text{GB}}% ~{}~{}~{}$	$\displaystyle\bm{0}\leq\bm{h}^{\text{EHP}}_{t}\leq\overline{\bm{h}}^{\text{EHP}}$	(48b)

where (48a) indicates the conversion of natural gas and electricity to heat with different efficiency; (48b) is the range of $\bm{h}^{\text{GB}}_{h}$ and $\bm{h}^{\text{CHP}}_{t}$ .

HVAC is an important tool for monitoring and controlling the indoor temperature to keep it within the required range [124, 128]:


	$\displaystyle T^{I}_{t}=\epsilon T^{I}_{t-1}+(1-\epsilon)\left(T^{O}_{t-1}-% \frac{\eta^{\text{HVAC}}E^{\text{HVAC}}_{t-1}}{A}\right)$		(49a)
	$\displaystyle\underline{E}^{\text{HVAC}}\leq E^{\text{HVAC}}_{t}\leq\overline{% E}^{\text{HVAC}}~{}~{}~{}\underline{T}^{I}\leq T^{I}_{t}\leq\overline{T}^{I}$		(49b)

where $E^{\text{HVAC}}$ denotes the energy consumption of HVAC; (49a) indicates the temperature change of the room; (49b) represents the limits of HVAC energy consumption $E^{\text{HVAC}}_{t}$ and indoor temperature $T^{I}_{t}$ .

IV-F Other Control Areas

TABLE VII: Safe RL Applications in Other Control Areas

Research Problem/ Objective Constraint Cum/Ins Hard/Soft Safety Constraint Techniques Key Features [133] Optimal scheduling of EV aggregators Constraints of EVs and driver’s energy demand Cum/Soft Lagrangian relaxation (III-A) An L2 norm penalty term is added to form an augmented Lagrangian function, which enhances the convexity and tractability of the CMDP. [134] V2G market Constraints of maximum incentive Cum/Soft Primal-dual theories (III-A) This is the first model-free learning algorithm designed to optimize incentives without knowing how EV users will react. It simultaneously improves load control and user satisfaction. [135] Pricing strategy for real-time congestion management Constraints of CS, operator, and grid Cum/Soft Adaptive constraint cost An adaptive scalability factor is introduced to balance safety and exploration. Then, a constrained cross-entropy method is employed to solve this pricing problem within a continuous action space. [53] Service restoration Power flow and voltage Constraints Cum+Ins/ Hard Safety layer (III-F) and penalty term Imitation learning is utilized to ensure acceptable initial performance. Action clip**, reward sha**, and expert demonstrations are employed to guarantee safe exploration. [136] Critical load restoration Constraints of loads, DERs, ESSs Cum/Soft Primal-dual differentiable programming (III-A) Compared to the traditional RL that uses arbitrarily large unit penalties, the proposed method can achieve better performance, evidenced by a higher objective value. [49] Unit commitment Constraints of scheduling Ins/Hard Clip** (III-E) Clip** of the action space is performed to ensure that uncertainty estimates are reasonable and within appropriate bounds, which are derived from historical data. [137] Reserve scheduling Constraints of voltage, RESs, tie line, and ESSs Cum/Soft Primal-dual method (III-A) ESS is fully utilized through more accurate intraday operation scenario simulations to enhance the system’s peak management and flexibility, reducing the reserve requirements of the main network.

In this section, the applications of safe RL in the electricity market, system restoration, and unit commitment and reserve scheduling are summarized, as detailed in Table VII. The specific state, action, reward, and constraints for each area are presented as follows:

IV-F1 Electricity Market

Electricity markets can promote the participation of users in the grid through dynamic pricing and incentive measures to balance supply and demand, thereby enhancing overall energy efficiency. [135] employs safe RL to formulate dynamic pricing strategies for controlling shiftable loads such as EVs, heating, ventilation, and HVAC systems. While some have used NNs to predict the optimal marginal prices of the OPF, such as in [138], these approaches do not derive a stochastic policy. In this section, although EVs are still involved, we mainly focus on aspects related to pricing and DSO operational costs, whereas Section IV-D primarily addresses the OPF that includes EVs. The state, action, reward, and constraints of electricity markets are shown as follows [133, 135, 134]:

State

The state is the observed status information of CSs and DSO, including the total cost of EV CSs $\bm{s}_{\text{cost}}^{\text{CS}}$ , the total cost of DSO $\bm{s}_{\text{cost}}^{\text{DSO}}$ .

\bm{s}^{\text{Market}}_{t}\triangleq\left(\bm{s}_{\text{cost}}^{\text{CS}},\bm% {s}_{\text{cost}}^{\text{DSO}}\right)

(50)

Action

The action denotes the incentive electricity price of different EV CSs $\bm{\Lambda}^{\text{CS}}$ .

\bm{a}^{\text{Market}}_{t}\triangleq\left(\bm{\Lambda}^{\text{CS}}\right)

(51)

Reward

The reward is to minimize the cost of EV users and maximize the profits of CSs and DSOs by setting different electricity prices.

R^{\text{Market}}(\bm{s},\bm{a})=-R^{\text{User}}+R^{\text{CS}}+R^{\text{DSO}}

(52)

Constraint

The EV model has been shown in section IV-D.

IV-F2 System Restoration

System restoration refers to the process of swiftly recovering load from an impacted state to normal operation following the occurrence of extreme events. [53, 136] generate system restoration strategies through the use of safe RL, either by controlling local DERs or by transferring load to safe areas. The state, action, reward, and constraints of system restoration are shown as follows:

State

The state includes the future renewable energy output forecasting $\bm{p}_{t+1}^{\text{RES}}$ , past restored loads $\bm{p}_{t-1}^{\text{Load}}$ , current SoC of the BESSs $\bm{SoC}_{t}^{\text{BESS}}$ , and remaining reserves of various types of generators $\overline{\bm{p}_{t}}^{\text{Gen}}-\bm{p}_{t}^{\text{Gen}}$ .

\bm{s}^{\text{Restoration}}_{t}\triangleq\left(\bm{p}_{t+1}^{\text{RES}},\bm{p% }_{t-1}^{\text{Load}},\bm{SoC}_{t}^{\text{BESS}},\overline{\bm{p}_{t}}^{\text{% Gen}}-\bm{p}_{t}^{\text{Gen}}\right)

(53)

Action

The action includes the restored load $\bm{p}_{\text{restored},t}^{\text{Load}}$ , active power output of all kinds of generators $\bm{p}_{t}^{\text{Gen}}$ and BESSs $\bm{p}_{t}^{\text{BESS}}$ .

\bm{a}^{\text{Restoration}}_{t}\triangleq\left(\bm{p}_{\text{restored},t}^{% \text{Load}},\bm{p}_{t}^{\text{Gen}},\bm{p}_{t}^{\text{BESS}}\right)

(54)

Reward

The reward is to maximize the sum of restored loads $\sum\bm{p}_{\text{restored},t}^{\text{Load}}$ .

R^{\text{Restoration}}(\bm{s},\bm{a})=\sum\bm{p}_{\text{restored},t}^{\text{% Load}}

(55)

Constraint

System restoration requires adherence to fundamental power system operational constraints and equipment constraints, including AC-PF constraints (15), DC-PF constraints (19), BESSs constraints (20), etc., all of which have been detailed above. In addition, it is necessary to add constraints to ensure that the load is restored monotonically:

\bm{p}_{\text{restored},t}^{\text{Load}}\leq\bm{p}_{\text{restored},t+1}^{% \text{Load}}

(56)

IV-F3 Unit Commitment and Reserve Scheduling

Unit commitment and reserve scheduling are both conducted in the day-ahead market, taking into account future uncertainties, such as those from loads and RESs. [49, 137] utilize safe RL to generate strategies for unit commitment, as well as coordinated strategies for tie-line reserve and energy storage, respectively. The state, action, reward, and constraints of unit commitment and reserve scheduling are shown as follows:

State

The state is the historical and current net load forecasts $P^{\text{Load}}_{\text{his/pre}}$ , commitment, start-up, and shut-down decisions at the previous stage:

\bm{s}^{\text{Reserve}}_{t}\triangleq\left(P^{\text{Load}}_{\text{his}},P^{% \text{Load}}_{\text{pre}},\bm{u}_{\text{start},t-1},\bm{u}_{\text{shut},t-1},% \bm{u}_{\text{com},t-1}\right)

(57)

Action

The action includes the current commitment, start-up, and shut-down decisions $\bm{u}_{\text{start/shut/com},t}$ , power output of generator $\bm{p}^{\text{Gen}}_{t}$ :

\bm{a}^{\text{Reserve}}_{t}\triangleq\left(\bm{u}_{\text{start},t},\bm{u}_{% \text{shut},t},\bm{u}_{\text{com},t},\bm{p}^{\text{Gen}}_{t}\right)

(58)

Reward

The reward is to minimize the overall costs, including the cost of power generation $R^{\text{Gen}}_{\text{cost}}$ , commitment costs $R^{\text{Commitment}}_{\text{cost}}$ , and start-up and shut-down costs $R^{\text{Start/Shut}}_{\text{cost}}$ :

R^{\text{Reserve}}(\bm{s},\bm{a})=-(R^{\text{Gen}}_{\text{cost}}+R^{\text{% Commitment}}_{\text{cost}}+R^{\text{Start}}_{\text{cost}}+R^{\text{Shut}}_{% \text{cost}})

(59)

Constraint


	$\displaystyle u_{\text{com},i,t}\underline{p}^{\text{Gen}}_{i}\leq p^{\text{% Gen}}_{i,t}\leq u_{\text{com},i,t}\overline{p}^{\text{Gen}}_{i},~{}~{}\forall i% \in\mathcal{G}$		(60a)
	$\displaystyle\sum_{\xi=t-\underline{t}_{\text{up},i}+1}^{t}u_{\text{start},i,% \xi}\leq u_{\text{com},i,t},~{}~{}\forall i\in\mathcal{G},\,t\in\{\underline{t% }_{\text{up},i},\ldots,t_{\text{tot}}\}$		(60b)
	$\displaystyle\sum_{\xi=t-\overline{t}_{\text{up},i}+1}^{t}u_{\text{shut},i,\xi% }\leq 1-u_{\text{com},i,t},~{}~{}\forall i\in\mathcal{G},\,t\in\{\overline{t}_% {\text{up},i},\ldots,t_{\text{tot}}\}$		(60c)
	$\displaystyle u_{\text{com},i,t}-u_{\text{com},i,t-1}=u_{\text{start},i,t}-u_{% \text{shut},i,t},~{}~{}\forall i\in\mathcal{G}$		(60d)
	$\displaystyle u_{\text{start},i,t}+u_{\text{shut},i,t}\leq 1,~{}~{}\forall i% \in\mathcal{G}$		(60e)
	$\displaystyle\sum_{i\in\mathcal{G}}\bm{p}^{\text{Gen}}_{i,t}\leq P^{\text{Load% }}_{\text{pre},t}~{}~{}~{}\sum_{i\in\mathcal{G}}u_{\text{com},i,t}\overline{% \bm{p}}^{\text{Gen}}_{e,i}\geq P_{\text{res},t}$		(60f)
	$\displaystyle\bm{p}^{\text{Gen}}_{t}-\bm{p}^{\text{Gen}}_{t-1}\leq\bm{R}_{% \text{up},t-1}\bm{u}_{\text{com},t-1}+\bm{S}_{\text{up},t}\bm{u}_{\text{start}% ,t}$		(60g)
	$\displaystyle\bm{p}^{\text{Gen}}_{t-1}-\bm{p}^{\text{Gen}}_{t}\leq\bm{R}_{% \text{down},t}\bm{u}_{\text{com},t}+\bm{S}_{\text{down},t}\bm{u}_{\text{shut},t}$		(60h)
	$\displaystyle u_{\text{start},i,t},u_{\text{shut},i,t},u_{\text{com},i,t}\in\{% 0,1\},~{}~{}\forall i\in\mathcal{G}$		(60i)

where (60a) indicates generator limits; (60b) and (60c) represent minimum up-time and down-time constraints; (60d) and (60e) denote the logical relationship between the generator commitment decisions and start-up/shut-down decisions; (60f) indicate the power generation and reserve constraints; (60g) and (60h) represent ramp-up and ramp-down limits of generators; (60i) specifies the integrality requirement of commitment and start-up/shut-down decisions.

Regarding RRL, some researchers have initiated their focus on game-theoretic RL for multistage games (also referred to as dynamic games) between attackers and defenders. This method, grounded in RL, aims to identify optimal attack sequences in pursuit of certain objectives [139, 140] and dynamic internal trading price strategy [141, 142]. Although chance-constrained RRL methods have garnered attention in automatic control [62, 61, 143], and several researchers have explored robust optimization and machine learning for power flow control [144, 145, 146], the realm of chance-constrained RRL for power system control and optimization remains underexplored.

V Challenges and Outlook

The application of safe RL in power systems is still in its infancy, facing a variety of challenges, i.e., scalability, and distributed setting as well as industrial deployment. In addition, we further discuss the potential future research directions.

V-A Challenges in Safe RL

Although the general challenges of RL have already been reviewed in [1], this subsection will explore the unique challenges faced by existing safe RL approaches.

V-A1 Scalability

Real-world power systems encompass a vast number of buses and power lines. For instance, the Eastern Interconnection, a major North American power grid system, has been modeled with over 60,000 buses in certain simulations. Consequently, large-scale multi-agent systems face scalability issues in such environments for two primary reasons. First, the state and action spaces expand dramatically with an increasing number of agents, a phenomenon known as the “curse of dimensionality.” This expansion results in an exponentially increasing search space for optimal actions. Secondly, as the number of buses grows, there is a rapid increase in the number of power flow constraints and other physics-hard constraints. Additionally, some research papers account for security constraints due to demand uncertainty in power systems, which further complicates the constraints in the RL training process. These factors make it challenging for Safe RL to converge to feasible results using stochastic gradient descent methods. One notable method is the use of factored action spaces, which involves decomposing the action space into smaller, manageable components [147]. This approach has been applied successfully in complex environments like StarCraft and Dota 2, showing significant versatility and efficiency in handling combinatorial and continuous control problems. Reduced order polytopal constraints and low order elliptical constraints are employed to approximate complex constraints for handling extensive constraints [148]. This method offers the potential for effectively incorporating extensive constraints in safe RL.

V-A2 Distributed Setting

Alongside this, the improvements in distributed systems and their algorithms have been essential to the rise of deep learning. Some researchers have made a number of advancements in creating multi-agent versions of the learning algorithms and in develo** distributed deep learning systems [149]. These methods have allowed us to scale up the training procedures for these very large-scale systems. This motivates the adoption of distributed structures for DRL, which lets agents converge quickly, and use efficient ways to explore and learn many different things at the same time.

A unique aspect of RL is the way agents actively shape the learning process by interacting with their environment and kee** a record of what they experience. So, DRL uses distributed approaches to create more learning data in a shorter time and to handle multiple learning processes simultaneously. This distributed DRL has been applied in complex power systems tasks like load scheduling [150] and in the management of EVs [151]. What makes safe RL more challenging in a distributed way is its approach to decomposing complex network constraints into smaller, manageable segments. The new challenge is ensuring that distributed safe DRL can reach a consensus on how to split these problems and converge on a solution, satisfying safety constraints throughout the learning process.

V-A3 Industrial Deployment

Current safe RL strategies largely rely on model-based approaches or training on historical data, which present significant challenges upon deployment in industrial settings. In all the papers reviewed in this review, only [126] involves interacting with a real building to train a safe RL model for temperature control, due to its low deployment risk. However, most studies related to the power grid have not lead to technologies used in practical deployment, due to safety concerns and often resort to model-based or data-based methods. The concern is that these methods, while effective in simulated environments or historical data, may not fully capture the complex, dynamic, and uncertain nature of real-world industrial processes. The discrepancies between simulated environments and actual operational conditions can lead to unexpected behaviors or safety violations, as the learned policies may not generalize well to unseen situations. Furthermore, reliance on historical data limits the system’s ability to adapt to novel conditions or operational changes that were not represented in the training set. This necessitates the development of adaptive, robust, and transferable safe RL algorithms that can continuously learn and adjust to new data in real-time, ensuring safety and efficiency in the face of the evolving operational dynamics characteristic of industrial environments.

V-B Future Directions in Safe RL

Regarding the challenges in applying safe RL to power systems, we present several potential future directions below.

V-B1 Exploring Offline Safe RL

DRL algorithms are based on an online learning paradigm, which presents a significant hurdle to their widespread adoption in power systems. In general, such online interaction is not practical, due to the expense (e.g., in robotics, educational agents, or healthcare) and risk (e.g., in autonomous driving, power systems, or healthcare) associated with exploring control actions in a safety-critical system [152]. Even in domains where online interaction is viable, leveraging previously collected data is often preferable —especially in complex domains where effective generalization necessitates extensive datasets.

Safe RL endeavors to achieve a policy that maximizes rewards within defined constraints, demonstrating advantages in meeting safety requirements for real-world applications. Nonetheless, many deep safe RL approaches primarily address safety post-training, neglecting the costs associated with constraint violations during the training phase. The necessity of collecting online interaction samples poses challenges in ensuring training safety, as preventing the agent from executing unsafe behaviors during learning is non-trivial [153]. Although carefully designed correction systems or human interventions can serve as safety mechanisms to filter unsafe actions during training, their application may prove costly due to the low sample efficiency of many RL approaches.

It is important to add that it is reasonable to use a simulation environment as a digital twin to train. In fact, even if it is unavoidable to have discrepancies between the simulations and the real conditions, high-fidelity simulations and numerical optimization that rely on models are already the nuts and bolts of energy management systems and are what guide control actions that are used to manage the grid today. If these models are accurate enough for decision systems used today to optimally select control actions, then it is reasonable to assume that are sufficiently accurate to train optimum policies. This is an important question to address in research since at the moment there is no comprehensive characterization of how the discrepancies between simulated and real environments affect performance and safety.

V-B2 Emphasizing Privacy in the Learning Process

As RL algorithms grow in popularity, so too do concerns about their privacy implications. The value or policy functions released are trained using reward signals and other inputs that often depend on sensitive data. In the domain of power systems, some rewards could inadvertently expose critical measurement data, such as voltage phasors and power demands, which in turn could lead to issues like false data injection. This historical data can potentially be deduced by recursively querying the released functions. One potential research direction is the development of differentially private algorithms for RL, which safeguard reward information from being compromised by techniques such as inverse RL [154]. The issue of privacy becomes even more critical in the offline RL setting, which is arguably more relevant for applications handling sensitive data. For example, in the EV charging domain, online RL necessitates the continual execution of new exploratory policies for each arriving EV, involving sensitive data like arrival and departure times. In contrast, offline RL relies on historical data of EV charging behavior, which can be particularly sensitive [155]. However, these differentially private mechanisms could introduce uncertainty into safety constraints. Concurrently, differentially private AC-PF constrained OPF has been explored, with studies formulating it as robust optimization to ensure the feasibility of these safety constraints [156]. One potential approach is to develop robust formulation training for safe DRL.

V-B3 Integrating Federated Learning Mechanism

To simultaneously address privacy and scalability issues, integrating federated learning into safe DRL could be a viable solution. In practical scenarios, RL faces challenges such as poor agent performance in large action and state spaces due to limited sample exploration and low sample efficiency impacting learning speed. Information exchange between agents can significantly boost learning rates. While distributed and parallel RL algorithms address these issues by centralizing data, parameters, or gradients for model training, this centralization can compromise privacy, leading to agent mistrust and data interception risks [157].

Federated learning, however, enables information exchange without compromising privacy, hel** agents adapt to diverse environments. It also addresses the simulation-reality gap often present in RL; while many RL algorithms depend on pre-training in simulated environments that do not perfectly mirror the real world, FL can amalgamate insights from both to more accurately bridge this gap [158]. Additionally, FL is beneficial when agents only observe partial features, enabling effective aggregation of this limited information. These considerations give rise to the idea of federated safe RL, which merges FL and safe RL within a privacy-preserving framework, adapting safe RL strategies for sequential decision-making tasks.

V-B4 Advancing Convex Insights

Convex optimization is extensively explored for its ability to provide analytical convergence and optimality guarantees, which in turn yield more stable policies. In the context of safe DRL with convex or non-convex constraints, integrating convex insights can enhance these convergence guarantees. Advancing these insights into safe DRL, consider exploring the application of ICNNs. Rather than training a conventional policy that inputs data and outputs control actions, which must adhere to stringent physical constraints, ICNNs offer a promising alternative due to their superior generalization capabilities. This approach bridges the gap between model accuracy and control tractability by constructing networks that are convex relative to their inputs, as detailed by [159] and further applied by [160] to model complex physical systems accurately. Consequently, training an ICNN-based policy can more easily incorporate convex constraints to ensure feasible and safe optimal control actions with performance guarantees.

Additionally, using convex functions to approximate the policy function represents another viable strategy. Here, policy optimization can be formulated as a constrained optimization problem, where both the objective and constraints are initially nonconvex. By creating a series of surrogate convex-constrained optimization problems—substituting nonconvex functions locally with convex quadratic functions derived from policy gradient estimators, as described by [161]—this method allows for the practical application of theoretical insights into operational policies. These strategies underscore the potential of convex optimization techniques in enhancing the robustness and effectiveness of safe DRL algorithms, particularly in applications that demand adherence to strict safety and physical constraints.

V-B5 Develo** LLM-in-the-loop RL

Numerous practical objectives and constraints of power systems, such as those outlined in the security guideline and operation manual, are based on linguistic stipulations and are difficult to model. In actual power system operations, when these constraints are violated, system operators typically need to take corrective actions [81]. Therefore, a human-in-the-loop approach has been proposed, where humans are integrated into the RL iteration process. This involvement allows humans to actively participate in constraint management, thereby enhancing the reliability of RL [162, 163]. Nonetheless, human-in-the-loop is limited by the availability and time constraints of human experts, making it unfeasible for tasks that require extensive amounts of training data or continuous adaptation.

With the advent of LLMs, the possibility of transitioning from human-in-the-loop to LLM-in-the-loop systems emerges as a viable alternative to address the aforementioned challenges. LLMs, with their powerful learning capabilities and vast knowledge based on power system data and linguistic stipulations, can provide consistent, real-time, and potentially unbiased feedback compared to human experts [164]. For example, [81] integrates the GPT LLM into the OPF framework with linguistic rules. This model quantifies natural language stipulations as objectives and constraints within the power system optimization problem for the first time. In the future, leveraging specialized knowledge in the power system domain to train dedicated LLMs will be crucial for extending their application across a broader spectrum of the power system industry. However, challenges remain in how LLMs can efficiently learn from power system knowledge bases, integrate with existing software tools, quantify uncertainties, and ensure the safety of constraints [164].

VI Conclusion

This paper represents the first comprehensive review of the application of safe RL in power systems, addressing pivotal operational tasks including optimal power generation dispatch, voltage control, stability control, EV charging control, electricity markets, service restoration, and unit commitment. In its first part, the paper introduces the foundational concepts of safe RL, including constraint classifications, existing algorithms, benchmarks, and the unique features and limitations of each algorithm. Subsequently, the paper provides a detailed overview of almost all existing studies on safe RL applications within power systems to date. It categorizes these studies according to their application domains, methodically enumerating each paper’s objectives, constraints, implemented safe RL techniques, environment types, and key features. This review establishes a foundation for the advancement of safe RL applications in power systems, providing direction for future research endeavors.

References

[1] X. Chen, G. Qu, Y. Tang, S. Low, and N. Li, “Reinforcement learning for selective key applications in power systems: Recent advances and future challenges,” IEEE Trans. Smart Grid, vol. 13, no. 4, pp. 2935–2958, Jul. 2022.
[2] L. Vu, T. Vu, T. L. Vu, and A. Srivastava, “Multi-agent deep reinforcement learning for distributed load restoration,” IEEE Trans. Smart Grid, 2023.
[3] J. Zhao, F. Li, S. Mukherjee, and C. Sticht, “Deep reinforcement learning-based model-free on-line dynamic multi-microgrid formation to enhance resilience,” IEEE Trans. Smart Grid, vol. 13, no. 4, pp. 2557–2567, Jul. 2022.
[4] S. Gu, L. Yang, Y. Du, G. Chen, F. Walter, J. Wang, Y. Yang, and A. Knoll, “A review of safe reinforcement learning: Methods, theory and applications,” arXiv preprint arXiv:2205.10330, 2022.
[5] J. Garcıa and F. Fernández, “A comprehensive survey on safe reinforcement learning,” J. Mach. Learn. Res., vol. 16, no. 1, pp. 1437–1480, 2015.
[6] J. Li, X. Wang, S. Chen, and D. Yan, “Research and application of safe reinforcement learning in power system,” in Proc. Asia Conf. Power Electr. Eng., 2023, pp. 1977–1982.
[7] W. Zhao, T. He, R. Chen, T. Wei, and C. Liu, “State-wise safe reinforcement learning: A survey,” in Proc. Int. Joint Conf. Artif. Intell., 2023.
[8] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. Cambridge, MA, USA: MIT Press, 2018.
[9] J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy optimization,” in Proc. Int. Conf. Mach. Learn., vol. 70, no. 10, Aug. 2017, pp. 22–31.
[10] H. Li and H. He, “Learning to operate distribution networks with safe deep reinforcement learning,” IEEE Trans. Smart Grid, vol. 13, no. 3, pp. 1860–1872, May 2022.
[11] Q. Zhang, K. Dehghanpour, Z. Wang, F. Qiu, and D. Zhao, “Multi-agent safe policy learning for power management of networked microgrids,” IEEE Trans. Smart Grid, vol. 12, no. 2, pp. 1048–1062, Mar. 2020.
[12] Y. Ye, H. Wang, P. Chen, Y. Tang, and G. Strbac, “Safe deep reinforcement learning for microgrid energy management in distribution networks with leveraged spatial-temporal perception,” IEEE Trans. Smart Grid, vol. 14, no. 5, pp. 3759–3775, Sep. 2023.
[13] Y. Liu, A. Halev, and X. Liu, “Policy learning with constraints in model-free reinforcement learning: A survey,” in Proc. Int. Joint Conf. Artif. Intell., Aug. 2021, pp. 1–8.
[14] Z. Yi, Y. Xu, and C. Wu, “Model-free economic dispatch for virtual power plants: An adversarial safe reinforcement learning approach,” IEEE Trans. Power Syst., 2023.
[15] W. Liu, P. Zhuang, H. Liang, J. Peng, and Z. Huang, “Distributed economic dispatch in microgrids based on cooperative reinforcement learning,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 6, pp. 2192–2203, Jun. 2018.
[16] H. Liu and W. Wu, “Two-stage deep reinforcement learning for inverter-based volt-var control in active distribution networks,” IEEE Trans. Smart Grid, vol. 12, no. 3, pp. 2037–2047, 2020.
[17] R. Yan, Q. Xing, and Y. Xu, “Multi agent safe graph reinforcement learning for pv inverter s based real-time de centralized volt/var control in zoned distribution networks,” IEEE Trans. Smart Grid, Jan. 2023.
[18] E. Altman, Constrained Markov decision processes. London, U.K.: Chapman and Hall, Mar. 1999.
[19] Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone, “Risk-constrained reinforcement learning with percentile risk criteria,” J. Mach. Learn. Res., vol. 18, no. 167, pp. 1–51, 2018.
[20] D. Bertsekas, Convex optimization algorithms. Athena Scientific, 2015.
[21] A. Ray, J. Achiam, and D. Amodei, “Benchmarking safe exploration in deep reinforcement learning,” arXiv preprint arXiv:1910.01708, 2019.
[22] S. Gu, J. G. Kuba, M. Wen, R. Chen, Z. Wang, Z. Tian, J. Wang, A. Knoll, and Y. Yang, “Multi-agent constrained policy optimisation,” arXiv preprint arXiv:2110.02793, 2021.
[23] A. Stooke, J. Achiam, and P. Abbeel, “Responsive safety in reinforcement learning by PID Lagrangian methods,” in Proc. Int. Conf. Mach. Learn., 2020, pp. 9133–9143.
[24] T. Wu, A. Scaglione, and D. Arnold, “Constrained reinforcement learning for predictive control in real-time stochastic dynamic optimal power flow,” IEEE Trans. Power Syst., 2023.
[25] W. Wang, N. Yu, Y. Gao, and J. Shi, “Safe off-policy deep reinforcement learning algorithm for volt-var control in power distribution systems,” IEEE Trans. Smart Grid, vol. 11, no. 4, pp. 3008–3018, Jul. 2019.
[26] T. Wu, A. Scaglione, A. P. Surani, D. Arnold, and S. Peisert, “Network-constrained reinforcement learning for optimal ev charging control,” in Proc. IEEE Int. Conf. Smart Grid Commun., 2023, pp. 1–6.
[27] Z. Yan and Y. Xu, “A hybrid data-driven method for fast solution of security-constrained optimal power flow,” IEEE Trans. Power Syst., vol. 37, no. 6, pp. 4365–4374, Nov. 2022.
[28] A. R. Sayed, X. Zhang, Y. Wang, G. Wang, J. Qiu, and C. Wang, “Online operational decision-making for integrated electric-gas systems with safe reinforcement learning,” IEEE Trans. Power Syst., 2023.
[29] D. Ding, K. Zhang, T. Basar, and M. Jovanovic, “Natural policy gradient primal-dual method for constrained markov decision processes,” Proc. Adv. Neural Inf. Process. Syst., vol. 33, pp. 8378–8390, 2020.
[30] T.-Y. Yang, J. Rosca, K. Narasimhan, and P. J. Ramadge, “Projection-based constrained policy optimization,” in Proc. Int. Conf. Learn. Represent., 2019, pp. 1–24.
[31] Y. Zhang, Q. Vuong, and K. Ross, “First order constrained optimization in policy space,” in Proc. Adv. Neural Inf. Process. Syst., vol. 33, 2020, pp. 15 338–15 349.
[32] L. Yang, J. Ji, J. Dai, Y. Zhang, P. Li, and G. Pan, “Cup: A conservative update policy algorithm for safe reinforcement learning,” arXiv preprint arXiv:2202.07565, 2022.
[33] M. Zhang, G. Guo, S. Magnússon, R. C. Pilawa-Podgurski, and Q. Xu, “Data driven decentralized control of inverter based renewable energy sources using safe guaranteed multi-agent deep reinforcement learning,” IEEE Trans. Sustain. Energy, 2023.
[34] Y. Jiang, Q. Ye, B. Sun, Y. Wu, and D. H. Tsang, “Data-driven coordinated charging for electric vehicles with continuous charging rates: A deep policy gradient approach,” IEEE Internet Things J., vol. 9, no. 14, pp. 12 395–12 412, Jul. 2021.
[35] W. Wang, N. Yu, J. Shi, and Y. Gao, “Volt-VAR control in power distribution systems with deep reinforcement learning,” in Proc. IEEE Int. Conf. Commun. Control Comput. Technol. Smart Grids, Oct. 2019, pp. 1–7.
[36] X. Wang, R. Wang, and Y. Cheng, “Safe reinforcement learning: A survey,” Acta Automatica Sinica, vol. 49, pp. 1–23, 2023.
[37] R. Sepulchre, M. Jankovic, and P. V. Kokotovic, Constructive nonlinear control. Springer Science & Business Media, 2012.
[38] T. J. Perkins and A. G. Barto, “Lyapunov design for safe reinforcement learning,” J. Mach. Learn. Res., vol. 3, pp. 803–832, Dec 2002.
[39] W. Cui, Y. Jiang, and B. Zhang, “Reinforcement learning for optimal primary frequency control: A lyapunov approach,” IEEE Trans. Power Syst., vol. 38, no. 2, pp. 1676–1688, 2022.
[40] W. Cui, J. Li, and B. Zhang, “Decentralized safe reinforcement learning for inverter-based voltage control,” Electric Power Syst. Res., vol. 211, 2022, Art. no. 108609.
[41] Y. Shi, G. Qu, S. Low, A. Anandkumar, and A. Wierman, “Stability constrained reinforcement learning for real-time voltage control,” in Proc. Amer. Control Conf., 2022, pp. 2715–2721.
[42] C. K. Williams and C. E. Rasmussen, Gaussian processes for machine learning. Cambridge, MA, USA: MIT Press, 2006, vol. 2, no. 3.
[43] A. K. Akametalu, J. F. Fisac, J. H. Gillula, S. Kaynama, M. N. Zeilinger, and C. J. Tomlin, “Reachability-based safe learning with gaussian processes,” in Proc. IEEE Conf. Decis. Control, Dec. 2014, pp. 1424–1431.
[44] Y. Sui, A. Gotovos, J. Burdick, and A. Krause, “Safe exploration for optimization with gaussian processes,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 997–1005.
[45] A. I. Cowen-Rivers, D. Palenicek, V. Moens, M. A. Abdullah, A. Sootla, J. Wang, and H. Bou-Ammar, “Samba: Safe model-based & active reinforcement learning,” Mach. Learn., vol. 111, no. 1, pp. 173–203, 2022.
[46] M. Alshiekh, R. Bloem, R. Ehlers, B. Könighofer, S. Niekum, and U. Topcu, “Safe reinforcement learning via shielding,” in Proc. AAAI Conf. Artif. Intell., vol. 32, no. 1, Apr. 2018, p. 2661–2669.
[47] P. Chen, S. Liu, X. Wang, and I. Kamwa, “Physics-shielded multi-agent deep reinforcement learning for safe active voltage control with photovoltaic/battery energy storage systems,” IEEE Trans. Smart Grid, Jul. 2022.
[48] Q. Zhang, M. H. B. Mahbod, C.-B. Chng, P.-S. Lee, and C.-K. Chui, “Residual physics and post-posed shielding for safe deep reinforcement learning method,” IEEE Trans. Cybern., 2022.
[49] A. Ajagekar and F. You, “Deep reinforcement learning based unit commitment scheduling under load and wind power uncertainty,” IEEE Trans. Sustain. Energy, vol. 14, no. 2, pp. 803–812, Apr. 2023.
[50] G. Dalal, K. Dvijotham, M. Vecerik, T. Hester, C. Paduraru, and Y. Tassa, “Safe exploration in continuous action spaces,” arXiv preprint arXiv:1801.08757, 2018.
[51] Z. Yi, X. Wang, C. Yang, C. Yang, M. Niu, and W. Yin, “Real-time sequential security-constrained optimal power flow: A hybrid knowledge-data-driven reinforcement learning approach,” IEEE Trans. Power Syst., vol. 39, no. 1, pp. 1664–1680, Jan. 2024.
[52] Y. Gao and N. Yu, “Model-augmented safe reinforcement learning for volt-var control in power distribution networks,” Appl. Energy, vol. 313, 2022, Art. no. 118762.
[53] Y. Du and D. Wu, “Deep reinforcement learning from demonstrations to assist service restoration in islanded microgrids,” IEEE Trans. Sustain. Energy, vol. 13, no. 2, pp. 1062–1072, Apr. 2022.
[54] Y. Wang, S. S. Zhan, R. Jiao, Z. Wang, W. **, Z. Yang, Z. Wang, C. Huang, and Q. Zhu, “Enforcing hard constraints with soft barriers: Safe reinforcement learning in unknown stochastic environments,” in Proc. Int. Conf. Mach. Learn., 2023, pp. 36 593–36 604.
[55] Y. Liu, J. Ding, and X. Liu, “IPO: Interior-point policy optimization under constraints,” in Proc. AAAI Conf. Artif. Intell., vol. 34, no. 04, 2020, pp. 4940–4947.
[56] H. Cui, Y. Ye, J. Hu, Y. Tang, Z. Lin, and G. Strbac, “Online preventive control for transmission overload relief using safe reinforcement learning with enhanced spatial-temporal awareness,” IEEE Trans. Power Syst., vol. 39, no. 1, pp. 517–532, Jan. 2024.
[57] T. L. Vu, S. Mukherjee, R. Huang, and Q. Huang, “Barrier function-based safe reinforcement learning for emergency control of power systems,” in Proc. IEEE Conf. Decis. Control, 2021, pp. 3652–3657.
[58] M. Zanon and S. Gros, “Safe reinforcement learning using robust mpc,” IEEE Trans. Autom. Control, vol. 66, no. 8, pp. 3638–3652, 2020.
[59] Y. Li, N. Li, H. E. Tseng, A. Girard, D. Filev, and I. Kolmanovsky, “Safe reinforcement learning using robust action governor,” in Proc. Learn. Dyn. Control, 2021, pp. 1093–1104.
[60] A. B. Kordabad, R. Wisniewski, and S. Gros, “Safe reinforcement learning using wasserstein distributionally robust mpc and chance constraint,” IEEE Access, vol. 10, pp. 130 058–130 067, 2022.
[61] S. Pfrommer, T. Gautam, A. Zhou, and S. Sojoudi, “Safe reinforcement learning with chance-constrained model predictive control,” in Proc. Learn. Dyn. Control, 2022, pp. 291–303.
[62] J. Coulson, J. Lygeros, and F. Dörfler, “Distributionally robust chance constrained data-enabled predictive control,” IEEE Trans. Autom. Control, vol. 67, no. 7, pp. 3289–3304, 2021.
[63] J. Yu, C. Gehring, F. Schäfer, and A. Anandkumar, “Robust reinforcement learning: A constrained game-theoretic approach,” in Proc. Learn. Dyn. Control, 2021, pp. 1242–1254.
[64] A. Rajeswaran, I. Mordatch, and V. Kumar, “A game theoretic framework for model based reinforcement learning,” in Proc. Int. Conf. Mach. Learn., 2020, pp. 7953–7963.
[65] A. Asheralieva and D. Niyato, “Hierarchical game-theoretic and reinforcement learning framework for computational offloading in uav-enabled mobile edge computing networks with multiple service providers,” IEEE Internet Things J., vol. 6, no. 5, pp. 8753–8769, 2019.
[66] C. Tessler, Y. Efroni, and S. Mannor, “Action robust reinforcement learning and applications in continuous control,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 6215–6224.
[67] A. Ray, J. Achiam, and D. Amodei, “Safety-gym: Tools for accelerating safe exploration research,” [Online]. Available: https://github.com/openai/safety-gym, accessed: Feb. 04, 2024.
[68] ——, “Safety-starter-agents: Basic constrained RL agents,” [Online]. Available: https://github.com/openai/safety-starter-agents, accessed: Feb. 04, 2024.
[69] J. Ji, B. Zhang, J. Zhou, X. Pan, W. Huang, R. Sun, Y. Geng, Y. Zhong, J. Dai, and Y. Yang, “Safety-gymnasium: A unified safe reinforcement learning benchmark,” arXiv preprint arXiv:2310.12567, 2023.
[70] ——, “Safety-gymnasium: A unified safe reinforcement learning benchmark,” [Online]. Available: https://github.com/PKU-Alignment/safety-gymnasium, accessed: Feb. 04, 2024.
[71] ——, “Safe policy optimization: A benchmark repository for safe reinforcement learning algorithms,” [Online]. Available: https://github.com/PKU-Alignment/Safe-Policy-Optimization, accessed: Feb. 04, 2024.
[72] J. Ji, J. Zhou, B. Zhang, J. Dai, X. Pan, R. Sun, W. Huang, Y. Geng, M. Liu, and Y. Yang, “Omnisafe: An infrastructure for accelerating safe reinforcement learning research,” arXiv preprint arXiv:2305.09304, 2023.
[73] ——, “Omnisafe: An infrastructural framework for accelerating safe rl research,” [Online]. Available: https://github.com/PKU-Alignment/omnisafe, accessed: Feb. 04, 2024.
[74] M. Eichelbeck, H. Markgraf, and M. Althoff, “Contingency-constrained economic dispatch with safe reinforcement learning,” in Proc. IEEE Int. Conf. Mach. Learn. Appl., 2022, pp. 597–602.
[75] Y. Ding, X. Chen, and J. Wang, “Deep reinforcement learning-based method for joint optimization of mobile energy storage systems and power grid with high renewable energy sources,” Batteries, vol. 9, no. 4, p. 219, 2023.
[76] A. R. Sayed, C. Wang, H. Anis, and T. Bi, “Feasibility constrained online calculation for real-time optimal power flow: A convex constrained deep reinforcement learning approach,” IEEE Trans. Power Syst., vol. 38, no. 6, pp. 5215–5227, Nov. 2023.
[77] Y. Chen, Q. Du, H. Liu, L. Cheng, and M. S. Younis, “Improved proximal policy optimization algorithm for sequential security-constrained optimal power flow based on expert knowledge and safety layer,” J. Modern Power Syst. Clean Energy, 2023.
[78] J. Zhang, L. Sang, Y. Xu, and H. Sun, “Networked multiagent-based safe reinforcement learning for low-carbon demand management in distribution networks,” IEEE Trans. Sustain. Energy, 2024.
[79] H. Li, Z. Wang, L. Li, and H. He, “Online microgrid energy management based on safe deep reinforcement learning,” in IEEE Symp. Ser. Comput. Intell., 2021, pp. 1–8.
[80] H. Shengren, P. P. Vergara, E. M. S. Duque, and P. Palensky, “Optimal energy system scheduling using a constraint-aware reinforcement learning algorithm,” Int. J. Electr. Power Energy Syst., vol. 152, 2023, Art. no. 109230.
[81] Z. Yan and Y. Xu, “Real-time optimal power flow with linguistic stipulations: Integrating gpt-agent and deep reinforcement learning,” IEEE Trans. Power Syst., 2023.
[82] ——, “Real-time optimal power flow: A lagrangian based deep reinforcement learning approach,” IEEE Trans. Power Syst., vol. 35, no. 4, pp. 3270–3273, Jul. 2020.
[83] S.-H. Hong and H.-S. Lee, “Robust energy management system with safe reinforcement learning using short-horizon forecasts,” IEEE Trans. Smart Grid, vol. 14, no. 3, pp. 2485–2488, May 2023.
[84] G. Ceusters, L. R. Camargo, R. Franke, A. Nowé, and M. Messagie, “Safe reinforcement learning for multi-energy management systems with known constraint functions,” Energy AI, vol. 12, 2023, Art. no. 100227.
[85] Y. Wang, D. Qiu, M. Sun, G. Strbac, and Z. Gao, “Secure energy management of multi-energy microgrid: A physical-informed safe reinforcement learning approach,” Appl. Energy, vol. 335, Apr. 2023, Art. no. 120759.
[86] B. Kocuk, S. S. Dey, and X. A. Sun, “Strong socp relaxations for the optimal power flow problem,” Oper. Res., vol. 64, no. 6, pp. 1177–1196, 2016.
[87] A. Marano-Marcolini, F. Capitanescu, J. L. Martinez-Ramos, and L. Wehenkel, “Exploiting the use of dc scopf approximation to improve iterative ac scopf algorithms,” IEEE Trans. Power Syst., vol. 27, no. 3, pp. 1459–1466, Aug. 2012.
[88] M. Yan, M. Shahidehpour, A. Paaso, L. Zhang, A. Alabdulwahab, and A. Abusorrah, “A convex three-stage scopf approach to power system flexibility with unified power flow controllers,” IEEE Trans. Power Syst., vol. 36, no. 3, pp. 1947–1960, May 2021.
[89] T. Su, J. Zhao, X. Chen, and X. Liu, “Analytic input convex neural networks-based model predictive control for power system transient stability enhancement,” in Proc. IEEE Power Energy Soc. Gen. Meeting, 2023, pp. 1–5.
[90] T. Wu, A. Scaglione, and D. Arnold, “Constrained reinforcement learning for stochastic dynamic optimal power flow control,” in Proc. IEEE Power Energy Soc. Gen. Meeting, 2023, pp. 1–5.
[91] M. Zhang, G. Guo, T. Zhao, and Q. Xu, “DNN assisted projection based deep reinforcement learning for safe control of distribution grids,” IEEE Trans. Power Syst., 2023.
[92] H. Liu and W. Wu, “Online multi-agent reinforcement learning for decentralized inverter-based volt-var control,” IEEE Trans. Smart Grid, vol. 12, no. 4, pp. 2980–2990, Jul. 2021.
[93] P. Kou, D. Liang, C. Wang, Z. Wu, and L. Gao, “Safe deep reinforcement learning-based constrained optimal control scheme for active distribution networks,” Appl. Energy, vol. 264, 2020, Art. no. 114772.
[94] G. Guo, M. Zhang, Y. Gong, and Q. Xu, “Safe multi-agent deep reinforcement learning for real-time decentralized control of inverter based renewable energy resources considering communication delay,” Appl. Energy, vol. 349, 2023, Art. no. 121648.
[95] Y. Chen, Y. Shi, D. Arnold, and S. Peisert, “Saver: Safe learning-based controller for real-time voltage regulation,” in Proc. IEEE Power Energy Soc. Gen. Meeting, 2022, pp. 1–5.
[96] H. T. Nguyen and D.-H. Choi, “Three-stage inverter-based peak shaving and volt-var control in active distribution networks using online safe deep reinforcement learning,” IEEE Trans. Smart Grid, vol. 13, no. 4, pp. 3266–3277, Jul. 2022.
[97] I. L. Carreño, A. Scaglione, D. Arnold, and T. Wu, “Voltage security region of a three-phase unbalanced distribution power system with dynamics,” IEEE Trans. Power Syst., 2024.
[98] T. Wu, A. Scaglione, and D. Arnold, “Reinforcement learning using physics inspired graph convolutional neural networks,” in 2022 58th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, 2022, pp. 1–8.
[99] C. Roberts, S.-T. Ngo, A. Milesi, A. Scaglione, S. Peisert, and D. Arnold, “Deep reinforcement learning for mitigating cyber-physical der voltage unbalance attacks,” in 2021 American Control Conference (ACC). IEEE, 2021, pp. 2861–2867.
[100] A. F. Bastos, S. Santoso, V. Krishnan, and Y. Zhang, “Machine learning-based prediction of distribution network voltage and sensor allocation,” in Proc. IEEE Power Energy Soc. Gen. Meeting, 2020, pp. 1–5.
[101] Q. Hou, E. Du, N. Zhang, and C. Kang, “Impact of high renewable penetration on the power system operation mode: A data-driven approach,” IEEE Trans. Power Syst., vol. 35, no. 1, pp. 731–741, Jan. 2020.
[102] H. Ruan, H. Gao, Y. Liu, L. Wang, and J. Liu, “Distributed voltage control in active distribution network considering renewable energy: A novel network partitioning method,” IEEE Trans. Power Syst., vol. 35, no. 6, pp. 4220–4231, Nov. 2020.
[103] H. Zhang, X. Sun, M. H. Lee, and J. Moon, “Deep reinforcement learning based active network management and emergency load-shedding control for power systems,” IEEE Trans. Smart Grid, 2023.
[104] J. Feng, W. Cui, J. Cortés, and Y. Shi, “Bridging transient and steady-state performance in voltage control: A reinforcement learning approach with safe gradient flow,” IEEE Control Syst. Lett., 2023.
[105] Y. Xia, Y. Xu, Y. Wang, S. Mondal, S. Dasgupta, A. K. Gupta, and G. M. Gupta, “A safe policy learning-based method for decentralized and economic frequency control in isolated networked-microgrid systems,” IEEE Trans. Sustain. Energy, vol. 13, no. 4, pp. 1982–1993, Oct. 2022.
[106] X. Wan, M. Sun, B. Chen, Z. Chu, and F. Teng, “Adapsafe: Adaptive and safe-certified deep reinforcement learning-based frequency control for carbon-neutral power systems,” in Proc. AAAI Conf. Artif. Intell., 2023.
[107] D. Tabas and B. Zhang, “Computationally efficient safe reinforcement learning for power systems,” in Proc. Amer. Control Conf., 2022, pp. 3303–3310.
[108] Y. Zhou, L. Zhou, D. Shi, and X. Zhao, “Coordinated frequency control through safe reinforcement learning,” in Proc. IEEE Power Energy Soc. Gen. Meeting, 2022, pp. 1–5.
[109] P. Gupta, A. Pal, and V. Vittal, “Coordinated wide-area dam** control using deep neural networks and reinforcement learning,” IEEE Trans. Power Syst., vol. 37, no. 1, pp. 365–376, Jan. 2022.
[110] K.-b. Kwon, S. Mukherjee, T. L. Vu, and H. Zhu, “Risk-constrained reinforcement learning for inverter-dominated power system controls,” IEEE Control Syst. Lett., vol. 7, pp. 3854–3859, 2023.
[111] M. Tarle, M. Larsson, G. Ingeström, L. Nordström, and M. Björkman, “Safe reinforcement learning for mitigation of model errors in facts setpoint control,” in Proc. Int. Conf. Smart Energy Syst. Technol., 2023, pp. 1–6.
[112] L. Wehenkel and M. Pavella, “Preventive vs. emergency control of power systems,” in Proc. IEEE PES Power Syst. Conf. Expo., 2004, pp. 1665–1670.
[113] P. Kundur et al., “Definition and classification of power system stability IEEE/CIGRE joint task force on stability terms and definitions,” IEEE Trans. Power Syst., vol. 19, no. 3, pp. 1387–1401, Aug. 2004.
[114] N. Hatziargyriou, J. Milanovic, C. Rahmann, V. Ajjarapu, C. Canizares, I. Erlich, D. Hill, I. Hiskens, I. Kamwa, B. Pal et al., “Definition and classification of power system stability–revisited & extended,” IEEE Trans. Power Syst., vol. 36, no. 4, pp. 3271–3281, Jul. 2021.
[115] G. Chen and X. Shi, “A deep reinforcement learning-based charging scheduling approach with augmented lagrangian for electric vehicle,” arXiv preprint arXiv:2209.09772, 2022.
[116] H. Zhang, J. Peng, H. Tan, H. Dong, and F. Ding, “A deep reinforcement learning-based energy management framework with lagrangian relaxation for plug-in hybrid electric vehicle,” IEEE Trans. Transport. Electrific., vol. 7, no. 3, pp. 1146–1160, Sep. 2020.
[117] S. Zhang, R. Jia, H. Pan, and Y. Cao, “A safe reinforcement learning-based charging strategy for electric vehicles in residential microgrid,” Appl. Energy, vol. 348, Oct. 2023, Art. no. 121490.
[118] H. Li, Z. Wan, and H. He, “Constrained ev charging scheduling based on safe deep reinforcement learning,” IEEE Trans. Smart Grid, vol. 11, no. 3, pp. 2427–2439, May 2020.
[119] R. Liessner, A. M. Dietermann, and B. Bäker, “Safe deep reinforcement learning hybrid electric vehicle energy management,” in Proc. Int. Conf. Agents Artif. Intell., 2019, pp. 161–181.
[120] International Energy Agency, “Global ev outlook 2023: Catching up with climate ambitions,” 2023. [Online]. Available: https://www.iea.org/reports/global-ev-outlook-2023
[121] F. Rassaei, W.-S. Soh, and K.-C. Chua, “Demand response for residential electric vehicles with random usage patterns in smart grids,” IEEE Trans. Sustain. Energy, vol. 6, no. 4, pp. 1367–1376, Oct. 2015.
[122] D. V. Le, R. Wang, Y. Liu, R. Tan, Y.-W. Wong, and Y. Wen, “Deep reinforcement learning for tropical air free-cooled data center control,” ACM Trans. Sensor Netw., vol. 17, no. 3, pp. 1–28, 2021.
[123] Q. Zhang, C.-B. Chng, K. Chen, P.-S. Lee, and C.-K. Chui, “DRL-S: Toward safe real-world learning of dynamic thermal management in data center,” Expert Syst. Appl., vol. 214, 2023, Art. no. 119146.
[124] H. Ding, Y. Xu, B. C. S. Hao, Q. Li, and A. Lentzakis, “A safe reinforcement learning approach for multi-energy management of smart home,” Electric Power Syst. Res., vol. 210, 2022, Art. no. 108120.
[125] P. Yu, H. Zhang, Y. Song, H. Hui, and G. Chen, “District cooling system control for providing operating reserve based on safe deep reinforcement learning,” IEEE Trans. Power Syst., vol. 39, no. 1, pp. 40–52, 2024.
[126] X. Lin, D. Yuan, and X. Li, “Reinforcement learning with dual safety policies for energy savings in building energy systems,” Buildings, vol. 13, no. 3, p. 580, 2023.
[127] C. Zhang, S. R. Kuppannagari, and V. K. Prasanna, “Safe building hvac control via batch reinforcement learning,” IEEE Trans. Sustain. Comput., vol. 7, no. 4, pp. 923–934, 2022.
[128] Z. Liang, C. Huang, W. Su, N. Duan, V. Donde, B. Wang, and X. Zhao, “Safe reinforcement learning-based resilient proactive scheduling for a commercial building considering correlated demand response,” IEEE Open Access J. Power Energy, vol. 8, pp. 85–96, 2021.
[129] D. Qiu, Z. Dong, X. Zhang, Y. Wang, and G. Strbac, “Safe reinforcement learning for real-time automatic control in a smart energy-hub,” Appl. Energy, vol. 309, 2022, Art. no. 118403.
[130] A. D. Garmroodi, F. Nasiri, and F. Haghighat, “Optimal dispatch of an energy hub with compressed air energy storage: A safe reinforcement learning approach,” J. Energy Storage, vol. 57, 2023, Art. no. 106147.
[131] I. Hamilton, H. Kennard, J. Amorocho, S. Steuwer, J. Kockat, Z. Toth, C. Delmastro, R. M. Gordon, and K. Petrichenko, “Global status report for buildings and construction,” UN Environment Programme, Tech. Rep., 2024.
[132] H.-Y. Liu, B. Balaji, S. Gao, R. Gupta, and D. Hong, “Safe hvac control via batch reinforcement learning,” in Proc. ACM/IEEE Int. Conf. Cyber- Phys. Syst., 2022, pp. 181–192.
[133] X. Shi, Y. Xu, G. Chen, and Y. Guo, “An augmented lagrangian-based safe reinforcement learning algorithm for carbon-oriented optimal scheduling of ev aggregators,” IEEE Trans. Smart Grid, vol. 15, no. 1, pp. 795–809, Jan. 2024.
[134] T. Zhu, X. Zhang, J. Duan, Z. Zhou, and X. Chen, “A budget-aware incentive mechanism for vehicle-to-grid via reinforcement learning,” in Proc. IEEE Int. Symp. Qual. Service, 2023, pp. 1–10.
[135] H. Yang, Y. Xu, and Q. Guo, “Dynamic incentive pricing on charging stations for real-time congestion management in distribution network: An adaptive model-based safe deep reinforcement learning method,” IEEE Trans. Sustain. Energy, 2023.
[136] X. Zhang, B. Knueven, A. Zamzam, M. Reynolds, and W. Jones, “Primal-dual differentiable programming for distribution system critical load restoration,” in Proc. IEEE Power Energy Soc. Gen. Meeting, 2023, pp. 1–5.
[137] X. Li, X. Han, and M. Yang, “Risk-based reserve scheduling for active distribution networks based on an improved proximal policy optimization algorithm,” IEEE Access, vol. 11, pp. 15 211–15 228, 2022.
[138] Z. Zhang and M. Wu, “Predicting real-time locational marginal prices: A gan-based approach,” IEEE Transactions on Power Systems, vol. 37, no. 2, pp. 1286–1296, 2021.
[139] Z. Ni and S. Paul, “A multistage game in smart grid security: A reinforcement learning solution,” IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 9, pp. 2684–2695, 2019.
[140] Y. Guo, L. Wang, Z. Liu, and Y. Shen, “Reinforcement-learning-based dynamic defense strategy of multistage game against dynamic load altering attack,” Int. J. Electr. Power Energy Syst., vol. 131, p. 107113, 2021.
[141] V.-H. Bui, A. Hussain, and W. Su, “A dynamic internal trading price strategy for networked microgrids: a deep reinforcement learning-based game-theoretic approach,” IEEE Trans. Smart Grid, vol. 13, no. 5, pp. 3408–3421, 2022.
[142] A.-P. Surani, T. Wu, and A. Scaglione, “Competitive reinforcement learning for real-time pricing and scheduling control in coupled ev charging stations and power networks,” in Proc. Int. Conf. Syst. Sci., 2024.
[143] B. Peng, J. Duan, J. Chen, S. E. Li, G. Xie, C. Zhang, Y. Guan, Y. Mu, and E. Sun, “Model-based chance-constrained reinforcement learning via separated proportional-integral lagrangian,” IEEE Trans. Neural Netw. Learn. Syst., 2022.
[144] A. Hassan, R. Mieth, M. Chertkov, D. Deka, and Y. Dvorkin, “Optimal load ensemble control in chance-constrained optimal power flow,” IEEE Trans. Smart Grid, vol. 10, no. 5, pp. 5186–5195, 2018.
[145] O. Ciftci, M. Mehrtash, and A. Kargarian, “Data-driven nonparametric chance-constrained optimization for microgrid energy management,” IEEE Trans. Ind. Inform., vol. 16, no. 4, pp. 2447–2457, 2019.
[146] J. Liang, W. Jiang, C. Lu, and C. Wu, “Joint chance-constrained unit commitment: Statistically feasible robust optimization with learning-to-optimize acceleration,” IEEE Trans. Power Syst., 2024.
[147] S. Tang, M. Makar, M. Sjoding, F. Doshi-Velez, and J. Wiens, “Leveraging factored action spaces for efficient offline reinforcement learning in healthcare,” in Proc. Adv. Neural Inf. Process. Syst., vol. 35, 2022, pp. 34 272–34 286.
[148] K. Hreinsson, A. Scaglione, M. Alizadeh, and Y. Chen, “New insights from the shapley-folkman lemma on dispatchable demand in energy markets,” IEEE Trans. Power Syst., vol. 36, no. 5, pp. 4028–4041, 2021.
[149] J. Verbraeken, M. Wolting, J. Katzy, J. Kloppenburg, T. Verbelen, and J. S. Rellermeyer, “A survey on distributed machine learning,” ACM Comput. Surv., vol. 53, no. 2, pp. 1–33, 2020.
[150] H.-M. Chung, S. Maharjan, Y. Zhang, and F. Eliassen, “Distributed deep reinforcement learning for intelligent load scheduling in residential smart grids,” IEEE Trans. Ind. Inform., vol. 17, no. 4, pp. 2752–2763, 2020.
[151] X. Tang, J. Chen, T. Liu, Y. Qin, and D. Cao, “Distributed deep reinforcement learning-based energy and emission management strategy for hybrid electric vehicles,” IEEE Trans. Veh. Technol., vol. 70, no. 10, pp. 9922–9934, 2021.
[152] S. Levine, A. Kumar, G. Tucker, and J. Fu, “Offline reinforcement learning: Tutorial, review, and perspectives on open problems,” arXiv preprint arXiv:2005.01643, 2020.
[153] Z. Liu, Z. Guo, Y. Yao, Z. Cen, W. Yu, T. Zhang, and D. Zhao, “Constrained decision transformer for offline safe reinforcement learning,” in Proc. Int. Conf. Mach. Learn., 2023, pp. 21 611–21 630.
[154] B. Wang and N. Hegde, “Privacy-preserving q-learning with functional noise in continuous spaces,” in Proc. Adv. Neural Inf. Process. Syst., vol. 32, 2019.
[155] D. Qiao and Y.-X. Wang, “Offline reinforcement learning with differential privacy,” in Proc. Adv. Neural Inf. Process. Syst., vol. 36, 2024.
[156] V. Dvorkin, F. Fioretto, P. Van Hentenryck, P. Pinson, and J. Kazempour, “Differentially private optimal power flow for distribution grids,” IEEE Trans. Power Syst., vol. 36, no. 3, pp. 2186–2196, 2020.
[157] J. Qi, Q. Zhou, L. Lei, and K. Zheng, “Federated reinforcement learning: Techniques, applications, and open challenges,” arXiv preprint arXiv:2108.11887, 2021.
[158] X. Fan, Y. Ma, Z. Dai, W. **g, C. Tan, and B. K. H. Low, “Fault-tolerant federated reinforcement learning with theoretical guarantee,” in Proc. Adv. Neural Inf. Process. Syst., vol. 34, 2021, pp. 1007–1021.
[159] B. Amos, L. Xu, and J. Z. Kolter, “Input convex neural networks,” in Proc. Int. Conf. Mach. Learn., 2017, pp. 146–155.
[160] Y. Chen, Y. Shi, and B. Zhang, “Optimal control via neural networks: A convex approach,” in Proc. Int. Conf. Learn. Representations, 2018.
[161] M. Yu, Z. Yang, M. Kolar, and Z. Wang, “Convergent policy optimization for safe reinforcement learning,” in Proc. Adv. Neural Inf. Process. Syst., vol. 32, 2019.
[162] X. Sun, Z. Xu, J. Qiu, H. Liu, H. Wu, and Y. Tao, “Optimal volt/var control for unbalanced distribution networks with human-in-the-loop deep reinforcement learning,” IEEE Trans. Smart Grid, vol. 15, no. 3, pp. 2639–2651, 2024.
[163] L. Yang, Q. Sun, N. Zhang, and Z. Liu, “Optimal energy operation strategy for we-energy of energy internet based on hybrid reinforcement learning with human-in-the-loop,” IEEE Trans. Syst., Man, Cybern.: Syst., vol. 52, no. 1, pp. 32–42, 2022.
[164] S. Majumder, L. Dong, F. Doudi, Y. Cai, C. Tian, D. Kalathi, K. Ding, A. A. Thatte, N. Li, and L. Xie, “Exploring the capabilities and limitations of large language models in the electric energy sector,” Joule, 2024, to appear.