A Review of Safe Reinforcement Learning Methods for Modern Power Systems

Tong Su, , Tong Wu, , Junbo Zhao, ,
Anna Scaglione, , Le Xie
This work is supported by the U.S. Department of Energy Solar Energy Technologies Office under award 37770. Tong Su and Junbo Zhao are with the Department of Electrical and Computer Engineering, University of Connecticut, Storrs, CT 06269, USA (e-mail: [email protected]; [email protected]). Tong Wu and Anna Scaglione are with the Department of Electrical and Computer Engineering, Cornell Tech, Cornell University, New York City, NY 10044, USA (e-mail: [email protected]; [email protected]). Le Xie is with the Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA (e-mail: [email protected]).
Abstract

Due to the availability of more comprehensive measurement data in modern power systems, there has been significant interest in develo** and applying reinforcement learning (RL) methods for operation and control. Conventional RL training is based on trial-and-error and reward feedback interaction with either a model-based simulated environment or a data-driven and model-free simulation environment. These methods often lead to the exploration of actions in unsafe regions of operation and, after training, the execution of unsafe actions when the RL policies are deployed in real power systems. A large body of literature has proposed safe RL strategies to prevent unsafe training policies. In power systems, safe RL represents a class of RL algorithms that can ensure or promote the safety of power system operations by executing safe actions while optimizing the objective function. While different papers handle the safety constraints differently, the overarching goal of safe RL methods is to determine how to train policies to satisfy safety constraints while maximizing rewards. This paper provides a comprehensive review of safe RL techniques and their applications in different power system operations and control, including optimal power generation dispatch, voltage control, stability control, electric vehicle (EV) charging control, buildings’ energy management, electricity market, system restoration, and unit commitment and reserve scheduling. Additionally, the paper discusses benchmarks, challenges, and future directions for safe RL research in power systems.

Index Terms:
Safe reinforcement learning, machine learning, power system operation, power system control, energy management, optimal power generation dispatch, EV charging, voltage control.

Nomenclature

Notations

γ𝛾\gammaitalic_γ

Discount factor γ[0,1)𝛾01\gamma\in[0,1)italic_γ ∈ [ 0 , 1 )

ΔΔ\Deltaroman_Δ

Difference operator

δ𝛿\deltaitalic_δ

Rotor angle

ϵ/Aitalic-ϵ𝐴\epsilon/Aitalic_ϵ / italic_A

Inertia parameter of temperature and thermal conductivity of HVAC

ε𝜀\varepsilonitalic_ε

Safety constraint bound

ζ𝜁\zetaitalic_ζ

Safety probability (1ζ1𝜁1-\zeta1 - italic_ζ is the the risk probability)

η,ηp/hCHP𝜂subscriptsuperscript𝜂CHP𝑝\eta,\eta^{\text{CHP}}_{p/h}italic_η , italic_η start_POSTSUPERSCRIPT CHP end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p / italic_h end_POSTSUBSCRIPT

Efficiency of charging or discharging, electrical/thermal energy efficiency of CHP

θ𝜃\thetaitalic_θ

Parameters of the policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT

ϑitalic-ϑ\varthetaitalic_ϑ

Grid state in the DC-PF approximation

𝚲ch/disEVsubscriptsuperscript𝚲EVch/dis\bm{\Lambda}^{\text{EV}}_{\text{ch/dis}}bold_Λ start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch/dis end_POSTSUBSCRIPT

Charging/selling electricity price of EV

𝚲Ele/Gas/Carsuperscript𝚲Ele/Gas/Car\bm{\Lambda}^{\text{Ele/Gas/Car}}bold_Λ start_POSTSUPERSCRIPT Ele/Gas/Car end_POSTSUPERSCRIPT

Price of electricity/gas/carbon

λ𝜆\lambdaitalic_λ

Penalty coefficient or Lagrange multiplier

ΠSsubscriptΠ𝑆\Pi_{S}roman_Π start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT

Policy set

πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, πθadvsuperscriptsubscript𝜋𝜃adv\pi_{\theta}^{\text{adv}}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT adv end_POSTSUPERSCRIPT

Parameterized policy, policy of adversary

πθksuperscriptsubscript𝜋𝜃𝑘\pi_{\theta}^{k}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, πθk+12superscriptsubscript𝜋𝜃𝑘12\pi_{\theta}^{k+\frac{1}{2}}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT

Policy at iteration k𝑘kitalic_k, intermediate policy between iterations k𝑘kitalic_k and k+1𝑘1k+1italic_k + 1

ρ0subscript𝜌0\rho_{0}italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

ρ0:𝒮[0,1]:subscript𝜌0𝒮01\rho_{0}:\mathcal{S}\rightarrow[0,1]italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : caligraphic_S → [ 0 , 1 ] is starting state distribution of 𝒮𝒮\mathcal{S}caligraphic_S

τ𝜏\tauitalic_τ

Trajectory τ=(s0,a0,s1,)𝜏subscript𝑠0subscript𝑎0subscript𝑠1\tau=(s_{0},a_{0},s_{1},\ldots)italic_τ = ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … )

𝝎𝝎\bm{\omega}bold_italic_ω

Frequency

𝒜,𝒂𝒜𝒂\mathcal{A},\bm{a}caligraphic_A , bold_italic_a

Action set, action

aSG/bSG/cSGsuperscript𝑎SGsuperscript𝑏SGsuperscript𝑐SGa^{\text{SG}}/b^{\text{SG}}/c^{\text{SG}}italic_a start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT / italic_b start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT / italic_c start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT

Fuel cost coefficients of SG

/𝒢/𝒢\mathcal{B}/\mathcal{G}/\mathcal{R}caligraphic_B / caligraphic_G / caligraphic_R

BESS/SG/RES set

𝒞,C𝒞𝐶\mathcal{C},Ccaligraphic_C , italic_C

Constraint set 𝒞={(Ci,εi)}i=1m𝒞subscriptsuperscriptsubscript𝐶𝑖subscript𝜀𝑖𝑚𝑖1\mathcal{C}=\{(C_{i},\varepsilon_{i})\}^{m}_{i=1}caligraphic_C = { ( italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT, constraint cost function C:𝒮×𝒜×𝒮R:𝐶𝒮𝒜𝒮RC:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow\textbf{R}italic_C : caligraphic_S × caligraphic_A × caligraphic_S → R

cRES/BESSsuperscript𝑐RES/BESSc^{\text{RES/BESS}}italic_c start_POSTSUPERSCRIPT RES/BESS end_POSTSUPERSCRIPT

Cost coefficients of RES/BESS

ch/dischdis\text{ch}/\text{dis}ch / dis

Charging/discharging of electricity or thermal for ESS

𝔻𝔻\mathbb{D}blackboard_D

Function to extract the vector of diagonal elements from a matrix

M,L,1R,D𝑀𝐿1𝑅𝐷M,L,\frac{1}{R},Ditalic_M , italic_L , divide start_ARG 1 end_ARG start_ARG italic_R end_ARG , italic_D

Inertia constant, load dam** coefficient, speed droop response coefficient, D=1R+L𝐷1𝑅𝐿D=\frac{1}{R}+Litalic_D = divide start_ARG 1 end_ARG start_ARG italic_R end_ARG + italic_L is the combined frequency response coefficient from synchronous generators and load

𝔼,E,Ecap𝔼𝐸subscript𝐸cap\mathbb{E},E,E_{\text{cap}}blackboard_E , italic_E , italic_E start_POSTSUBSCRIPT cap end_POSTSUBSCRIPT

Expectation function, energy associated with devices, energy capacity of ESS

/𝒩𝒩\mathcal{E}/\mathcal{N}caligraphic_E / caligraphic_N

Edge/node set

f,g,h𝑓𝑔f,g,hitalic_f , italic_g , italic_h

State transition dynamics or the model of the environment, equality constraints with a total number of m𝑚mitalic_m, inequality constraints with a total number of n𝑛nitalic_n.

G/N𝐺𝑁G/Nitalic_G / italic_N

Cardinality of the set 𝒢/𝒩𝒢𝒩\cal G/\cal Ncaligraphic_G / caligraphic_N

𝒈𝒈\bm{g}bold_italic_g

Gas input of CHP or GB

/\mathcal{H}/*caligraphic_H / ∗

Hermitian/conjugate for a vector or matrix

𝒉𝒉\bm{h}bold_italic_h

Thermal energy generation or load vector

𝒊𝒊\bm{i}bold_italic_i

Current phasor vector

𝒥Rπθsuperscriptsubscript𝒥𝑅subscript𝜋𝜃\mathcal{J}_{R}^{\pi_{\theta}}caligraphic_J start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, 𝒥hiπθsuperscriptsubscript𝒥subscript𝑖subscript𝜋𝜃\mathcal{J}_{h_{i}}^{\pi_{\theta}}caligraphic_J start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT

Reward performance, constraint cost performance of inequality constraints

\mathcal{L}caligraphic_L

Lagrangian

\mathcal{M}caligraphic_M, Csubscript𝐶\mathcal{M}_{C}caligraphic_M start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT

MDP =(𝒮,𝒜,𝒫,r,ρ0,γ)𝒮𝒜𝒫𝑟subscript𝜌0𝛾\mathcal{M}=(\mathcal{S},\mathcal{A},\mathcal{P},r,\rho_{0},\gamma)caligraphic_M = ( caligraphic_S , caligraphic_A , caligraphic_P , italic_r , italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_γ ), CMDP C=(𝒮,𝒜,𝒫,R,ρ0,γ,𝒞)subscript𝐶𝒮𝒜𝒫𝑅subscript𝜌0𝛾𝒞\mathcal{M}_{C}=(\mathcal{S},\mathcal{A},\mathcal{P},R,\rho_{0},\gamma,% \mathcal{C})caligraphic_M start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = ( caligraphic_S , caligraphic_A , caligraphic_P , italic_R , italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_γ , caligraphic_C )

,𝒫𝒫\mathbb{P},\mathcal{P}blackboard_P , caligraphic_P

Probability function, 𝒫:𝒮×𝒜×𝒮[0,1]:𝒫𝒮𝒜𝒮01\mathcal{P}:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow[0,1]caligraphic_P : caligraphic_S × caligraphic_A × caligraphic_S → [ 0 , 1 ] is the transition matrix, where 𝒫(st+1|st,at)𝒫conditionalsubscript𝑠𝑡1subscript𝑠𝑡subscript𝑎𝑡\mathcal{P}(s_{t+1}|s_{t},a_{t})caligraphic_P ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) denotes the probability of state transition from stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT after taking action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

Phis/preLoadsubscriptsuperscript𝑃Loadhis/preP^{\text{Load}}_{\text{his/pre}}italic_P start_POSTSUPERSCRIPT Load end_POSTSUPERSCRIPT start_POSTSUBSCRIPT his/pre end_POSTSUBSCRIPT

Historical/current net load forecast

Pressubscript𝑃resP_{\text{res}}italic_P start_POSTSUBSCRIPT res end_POSTSUBSCRIPT

Reserve requirement

𝒑/𝒒𝒑𝒒\bm{p}/\bm{q}bold_italic_p / bold_italic_q

Active/reactive power generation or load vector

𝒑¯eGensubscriptsuperscript¯𝒑Gen𝑒\overline{\bm{p}}^{\text{Gen}}_{e}over¯ start_ARG bold_italic_p end_ARG start_POSTSUPERSCRIPT Gen end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT

Maximum emergency power generation of generator

𝒑Bussuperscript𝒑Bus\bm{p}^{\text{Bus}}bold_italic_p start_POSTSUPERSCRIPT Bus end_POSTSUPERSCRIPT

Bus power injection

pij/qij/sijsubscript𝑝𝑖𝑗subscript𝑞𝑖𝑗subscript𝑠𝑖𝑗p_{ij}/q_{ij}/s_{ij}italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT / italic_q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT / italic_s start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT

Active/reactive/apparent power for branch ij𝑖𝑗ijitalic_i italic_j

𝒑e/𝒑msubscript𝒑𝑒subscript𝒑𝑚\bm{p}_{e}/\bm{p}_{m}bold_italic_p start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT / bold_italic_p start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT

Electrical/mechanical power

R𝑅Ritalic_R

Reward function R:𝒮×𝒜×𝒮:𝑅𝒮𝒜𝒮R:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow\mathbb{R}italic_R : caligraphic_S × caligraphic_A × caligraphic_S → blackboard_R

𝑹up/downsubscript𝑹up/down\bm{R}_{\text{up/down}}bold_italic_R start_POSTSUBSCRIPT up/down end_POSTSUBSCRIPT

Ramp-up/down rate of generators

rij/xijsubscript𝑟𝑖𝑗subscript𝑥𝑖𝑗r_{ij}/x_{ij}italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT / italic_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT

Resistance/reactance of line ij𝑖𝑗ijitalic_i italic_j

𝒮,𝒔ap,𝒔𝒮subscript𝒔ap𝒔\mathcal{S},\bm{s}_{\text{ap}},\bm{s}caligraphic_S , bold_italic_s start_POSTSUBSCRIPT ap end_POSTSUBSCRIPT , bold_italic_s

State set, apparent power vector, state

𝑺up/downsubscript𝑺up/down\bm{S}_{\text{up/down}}bold_italic_S start_POSTSUBSCRIPT up/down end_POSTSUBSCRIPT

Start-up/shut-down rate of generators

𝒯,t𝒯𝑡\mathcal{T},tcaligraphic_T , italic_t

Time step set of trajectory τ𝜏\tauitalic_τ, time instant

t¯up/t¯up,ttotsubscript¯𝑡upsubscript¯𝑡upsubscript𝑡tot\overline{t}_{\text{up}}/\underline{t}_{\text{up}},t_{\text{tot}}over¯ start_ARG italic_t end_ARG start_POSTSUBSCRIPT up end_POSTSUBSCRIPT / under¯ start_ARG italic_t end_ARG start_POSTSUBSCRIPT up end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT

Maximum/minimum up time of Gens, total time

T,H,TI/O𝑇𝐻superscript𝑇𝐼𝑂T,H,T^{I/O}italic_T , italic_H , italic_T start_POSTSUPERSCRIPT italic_I / italic_O end_POSTSUPERSCRIPT

Temperature, humidity, indoor/outdoor temperature

𝒖start/shut/comsubscript𝒖start/shut/com\bm{u}_{\text{start/shut/com}}bold_italic_u start_POSTSUBSCRIPT start/shut/com end_POSTSUBSCRIPT

Startup/shutdown/commitment status of Gens

𝒗/ϕ𝒗bold-italic-ϕ\bm{v}/\bm{\phi}bold_italic_v / bold_italic_ϕ

Voltage phasor/phase vector 𝒗t=|𝒗|e𝔧ϕsubscript𝒗𝑡direct-product𝒗superscript𝑒𝔧bold-italic-ϕ\bm{v}_{t}=|\bm{v}|\odot e^{\mathfrak{j}\bm{\phi}}bold_italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = | bold_italic_v | ⊙ italic_e start_POSTSUPERSCRIPT fraktur_j bold_italic_ϕ end_POSTSUPERSCRIPT,

𝐘/𝐁𝐘𝐁\mathbf{Y}/\mathbf{B}bold_Y / bold_B

Admittance/susceptance matrix

¯/¯¯absent¯absent\overline{\ }/\underline{\ }over¯ start_ARG end_ARG / under¯ start_ARG end_ARG

Maximum/minimum values of the variable or vector

Abbreviations

AC/DC

Alternating current/direct current

ADN

Active Distribution Network

AMI

Advanced Metering Infrastructure

(B/M/T)ESS

(Battery/Mobile/Thermal) Energy Storage System

CHP

Combined Heat and Power system

(C)MDP

(Constrained) Markov Decision Process

CPO

Constrained Policy Optimization

CPPO

Constraint-controlled PPO

CS

Charging Station

CUP

Conservative Update Policy

DDPG

Deep Deterministic Policy Gradient

DG

Distributed Generation

DER

Distributed Energy Resource

(D/R)NN

(Deep/Recurrent) Neural Network

DSO

Distribution System Operator

(D/R)RL

(Deep/Robust) Reinforcement Learning

EHP

Electric Heat Pump

EV

Electric Vehicle

FACTS

Flexible AC Transmission System

FOCOPS

First Order Constrained Optimization in Policy Space

GCN

Graph Convolution Network

GB

Gas Boiler

Gen

Generator

GP

Gaussian Process

GPT

Generative Pre-trained Transformer

HVAC

Heating, Ventilation and Air-Conditioning

ICNN

Input Convex Neural Network

IPO

Interior-point Policy Optimization

Lag

Lagrangian methods

LLM

Large Language Model

MA(C)

Multi-Agent (Constrained)

MIP

Mixed-Integer Linear

MPPT

Maximum Power Point Tracking

PCPO

Projection-based Constrained Policy Optimization

PDO

Primal-Dual Optimization

PILCO

Probabilistic Inference for Learning Control

PMU

Phasor Measurement Unit

PPO

Proximal Policy Optimization

p.u.

per unit

RES

Renewable Energy Source

RCPO

Reward Constrained Policy Optimization

SAC

Soft Actor-Critic

SafePO

Safe Policy Optimization

(SC)(O)PF

(Security Constrained) (Optimal) Power Flow

SG

Synchronous Generator

SoC

State of Change

TD3

Twin-Delayed Deep Deterministic policy gradient

TL

Thermal Load (such as room heater and water heater)

TR(PO/M)

Trust Region (Policy Optimization/Method)

V2G

Vehicle-to-Grid

V, F

Voltage, Frequency

I Introduction

With the extensive integration of RESs, ESSs, and advanced power electronic devices, modern power systems are facing increased uncertainty and complexity, which translate to higher computational burden when modeling the stochastic non-linear nature of the control and decision problems. However, thanks to the widespread deployment of smart sensors, such as PMUs, along with advanced communication technologies, a vast amount of power system data can be measured and utilized for state estimation and control. As a result, data-driven approaches like RL have emerged as the key candidates for the numerical optimization of power systems decision and/or control policies[1], which would be otherwise intractable to derive. Conventionally, RL training is based on trial-and-error and reward feedback interaction with a model-based simulated environment [2] or a data-driven model-free simulated environment [3]. Recently, DRL, which embeds NNs as the policy function, has proven expressive enough to solve complicated control tasks. Additionally, the NN approach is used to reduce computation costs for online implementation. Once the NNs are trained, they approximate closed-form solutions and produce results quickly. However, nothing prevents the exploration of unsafe ranges during training and the execution of unsafe actions when the trained policies are deployed in real power systems. Therefore, the practical application of RL policies cannot be based on vanilla RL training [4].

In 2015, safe RL was first defined as “the process of learning policies that maximize the expectation of the reward in problems, where it is crucial to ensure reasonable system performance and/or respect safety constraints during the learning and/or deployment processes” [5]. Concurrently, the safe RL literature has been paid increasing attention. The methods can be coarsely divided into two categories: in one category the authors proposed to add to the reward function a safety factor that penalizes safety violations, and in the other category in the training phase the exploration process has been modified incorporating mechanisms that yield safe policies[5]. Based on these two approaches, numerous safe RL methods have been proposed and many have been applied and tailored for solving power systems decision and control problems, such as energy management, optimal power generation dispatch, EV Charging, voltage control, and others that this paper will cover in Section IV.

Reference [6] is currently the only paper that provides an overview of safe RL applications. However, the field is fast evolving and we aim to provide, first a comprehensive review of various safe RL techniques in general, and then a deep dive of their applications in power systems. The main contributions of the paper are as follows:

  1. 1.

    This paper provides a comprehensive review of safe RL, covering its fundamental concepts, constraint classifications, existing algorithms, and benchmarks. It details the unique features and limitations of each RL algorithm, providing a foundation for future research endeavors in the domain of safe RL.

  2. 2.

    Comprehensive review of the application of safe RL in power systems follows, covering almost all existing papers in this area. It categorizes these papers based on their application domains, listing each paper’s objectives, constraints, implemented safe RL techniques, environment types, and key features.

  3. 3.

    We explore the key challenges and future research opportunities in safe RL for applications within power systems.

The framework of this paper is shown in Fig. 1. The rest of the paper is organized as follows. Section II introduces the CMDP and constraints. Section III provides a detailed introduction and classification of safe RL. Section IV offers a comprehensive review and comparative analysis of safe RL applications in different fields within the power system. Challenges and outlook are discussed in Section V and finally, Section VI concludes the paper.

Refer to caption
Figure 1: The framework of safe RL in power system application.

II Constrained Markov Decision Process

II-A Problem formulation

MDPs are defined by a tuple =(𝒮,𝒜,𝒫,R,ρ0,γ)𝒮𝒜𝒫𝑅subscript𝜌0𝛾\mathcal{M}=(\mathcal{S},\mathcal{A},\mathcal{P},R,\rho_{0},\gamma)caligraphic_M = ( caligraphic_S , caligraphic_A , caligraphic_P , italic_R , italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_γ ) which are, respectively, the state space, action space, probability distribution, reward function, initial state ρ0𝒮subscript𝜌0𝒮\rho_{0}\in{\cal S}italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_S and discount factor. When the decision problem fits in an MDP, the objective is to determine the policy π𝜋\piitalic_π that maximizes the expected discounted reward 𝒥Rπθsuperscriptsubscript𝒥𝑅subscript𝜋𝜃\mathcal{J}_{R}^{\pi_{\theta}}caligraphic_J start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, i.e.[4, 7, 8]:

𝒥Rπθ=𝔼τπ[t=0γtR(𝒔t,𝒂t,𝒔t+1)]superscriptsubscript𝒥𝑅subscript𝜋𝜃subscript𝔼similar-to𝜏𝜋delimited-[]superscriptsubscript𝑡0superscript𝛾𝑡𝑅subscript𝒔𝑡subscript𝒂𝑡subscript𝒔𝑡1\mathcal{J}_{R}^{\pi_{\theta}}=\mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^{% \infty}\gamma^{t}R(\bm{s}_{t},\bm{a}_{t},\bm{s}_{t+1})\right]caligraphic_J start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_τ ∼ italic_π end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_R ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] (1)

where τπsimilar-to𝜏𝜋\tau\sim\piitalic_τ ∼ italic_π indicates that the distribution over trajectories depends on the policy π𝜋\piitalic_π; similarly 𝒔0ρ0similar-tosubscript𝒔0subscript𝜌0\bm{s}_{0}\sim\rho_{0}bold_italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, 𝒂tπ(|𝒔t)\bm{a}_{t}\sim\pi(\cdot|\bm{s}_{t})bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π ( ⋅ | bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), 𝒔t+1𝒫(|𝒔t,𝒂t)\bm{s}_{t+1}\sim\mathcal{P}(\cdot|\bm{s}_{t},\bm{a}_{t})bold_italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ caligraphic_P ( ⋅ | bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Even if the transition probabilities and reward function are fully known, this task is often intractable. However, the approach taken normally is to learn the policy, using some parametrization.

The CMDP C=(𝒮,𝒜t,𝒫,R,ρ0,γ,𝒞)subscript𝐶𝒮subscript𝒜𝑡𝒫𝑅subscript𝜌0𝛾𝒞\mathcal{M}_{C}=(\mathcal{S},\mathcal{A}_{t},\mathcal{P},R,\rho_{0},\gamma,% \mathcal{C})caligraphic_M start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = ( caligraphic_S , caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_P , italic_R , italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_γ , caligraphic_C ) is an extension of a standard MDP, that addresses a frequent model variation: the case in which the action space 𝒜tsubscript𝒜𝑡{\cal A}_{t}caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a function of the state space 𝒮𝒮{\cal S}caligraphic_S, i.e. 𝒔t𝒜tmaps-tosubscript𝒔𝑡subscript𝒜𝑡\bm{s}_{t}\mapsto{\cal A}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ↦ caligraphic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, because the change in the environment affects what is a safe or feasible action, or due to the state-dependent cost of the action, which in the formulation needs to be below a threshold. This occurs in physical systems in which the boundary conditions, the state and the laws of physics limit what is feasible, what would lead to operations that are unsafe and how expensive is a certain agent action. In a nutshell, what differentiates the various instances of CMDP from a conventional MDP is the class of constraints that characterize the action space as a function of the system dynamics and the specific engineering problem and context that define the constraints. In this review, we define the CMDP for power system problems:

maxπθΠS𝒥Rπθsubscriptsubscript𝜋𝜃subscriptΠ𝑆superscriptsubscript𝒥𝑅subscript𝜋𝜃\displaystyle\max_{\pi_{\theta}\in\Pi_{S}}\mathcal{J}_{R}^{\pi_{\theta}}roman_max start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∈ roman_Π start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (2)
s.t. 𝒂t is feasiblesubscript𝒂𝑡 is feasible\displaystyle~{}~{}\bm{a}_{t}\text{ is feasible }bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is feasible

where 𝒂tsubscript𝒂𝑡\bm{a}_{t}bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is feasible not only means that 𝒂tsubscript𝒂𝑡\bm{a}_{t}bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is constrained within its upper and lower limits, but also that the resulting 𝒔tsubscript𝒔𝑡\bm{s}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT falls within specified feasible sets. In power systems, constraints on the upper and lower bounds of 𝒂tsubscript𝒂𝑡\bm{a}_{t}bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT relate to the control ranges of various controllable devices, such as the power output of SGs, RESs, and ESSs, as well as the temperature setpoint of HVAC systems, which can typically be enforced by simply restricting the action space of RL. 𝒔tsubscript𝒔𝑡\bm{s}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT falls within specified feasible sets means that the state adheres to safe and stable operation constraints, such as boundary constraints of voltages, line flows, and building temperatures, as well as stability constraints of voltages, frequency, and rotor angles. Due to the highly non-linear and non-convex nature of power systems, obtaining feasible 𝒂tsubscript𝒂𝑡\bm{a}_{t}bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT that guarantees feasible 𝒔tsubscript𝒔𝑡\bm{s}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is challenging. This is also the main challenge of training safe RL.

II-B Constraints

II-B1 Instantaneous Constraints

Instantaneous constraints are prevalent in power systems. For instance, in the optimal power generation dispatch of power systems, we encounter constraints such as power flow, dynamic limitations associated with BESSs, voltage magnitude bounds, and power generation limits, as detailed in Section IV-A. Another instance is voltage control, which incorporates additional voltage droop control dynamics and stability constraints, described in Section IV-B. We also explore other examples such as stability control, EV charging control, and building energy management in Section IV. In general, these constraints can be expressed as follows:

maxπθΠS𝒥Rπθsubscriptsubscript𝜋𝜃subscriptΠ𝑆superscriptsubscript𝒥𝑅subscript𝜋𝜃\displaystyle\max_{\pi_{\theta}\in\Pi_{S}}\mathcal{J}_{R}^{\pi_{\theta}}roman_max start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∈ roman_Π start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (3)
s.t.gj(𝒔t,𝒂t,\displaystyle\text{s.t.}~{}~{}g_{j}(\bm{s}_{t},\bm{a}_{t},s.t. italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , 𝒔t+1)=0,j=1,,m\displaystyle\bm{s}_{t+1})=0,~{}~{}j=1,\cdots,mbold_italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) = 0 , italic_j = 1 , ⋯ , italic_m
hk(𝒔t,𝒂t,\displaystyle~{}h_{k}(\bm{s}_{t},\bm{a}_{t},italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , 𝒔t+1)0,k=1,,n\displaystyle\bm{s}_{t+1})\leq 0,~{}~{}k=1,\cdots,nbold_italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ≤ 0 , italic_k = 1 , ⋯ , italic_n

where the control action must fulfill both the m𝑚mitalic_m equality and n𝑛nitalic_n inequality constraints. We incorporate the terms 𝒔tsubscript𝒔𝑡\bm{s}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝒔t+1subscript𝒔𝑡1\bm{s}_{t+1}bold_italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT within these constraints to represent the time-varying bounds of 𝒂tsubscript𝒂𝑡\bm{a}_{t}bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Additionally, the dynamical constraints are also integrated into the aforementioned constraints.

II-B2 Cumulative Constraints

Cumulative constraints mandate that the sum or average of a specific cost signal remains within prescribed limits, calculated from the beginning of an event to the present time. Examples include total revenue and network throughput. These constraints are commonly applied in robot locomotion and manipulation, as discussed in [9]. Although several studies have attempted to adapt these constraints to power systems as a more flexible alternative to hard constraints, the application remains limited. For instance, [10] employs a discounted cumulative formulation in (4) to establish safety constraints in the management of distribution networks. In particular, they relax instantaneous constraints, such as voltage bounds, SoC bounds, and power quality, to a discounted cumulative formulation. Similarly, [11, 12] also utilize this approach. However, such constraints may not fully capture all safety requirements, though they do offer a partial enhancement of safety measures, providing some benefit over no constraints at all. The reason these studies do not consider instantaneous constraints is that cumulative relaxation offers a straightforward method to adapt constrained RL techniques, originally developed for robot locomotion and manipulation, to power systems. This approach not only simplifies implementation but also provides methodological insights that could potentially be extended to handle instantaneous constraints in future research.

To make the review more self-contained, we will review three kinds of cumulative constraints. In [13], the constraints for safe RL are divided into cumulative constraints and instantaneous constraints. For cumulative constraints, they are further categorized as discounted cumulative constraints (4), mean valued constraints (5), and probabilistic constraints (6). The discounted cumulative constraint is of the form:

𝒥hiπθ=𝔼τπ[t=0γthi(𝒔t,𝒂t,𝒔t+1)]εisuperscriptsubscript𝒥subscript𝑖subscript𝜋𝜃subscript𝔼similar-to𝜏𝜋delimited-[]superscriptsubscript𝑡0superscript𝛾𝑡subscript𝑖subscript𝒔𝑡subscript𝒂𝑡subscript𝒔𝑡1subscript𝜀𝑖\mathcal{J}_{h_{i}}^{\pi_{\theta}}=\mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^{% \infty}\gamma^{t}h_{i}(\bm{s}_{t},\bm{a}_{t},\bm{s}_{t+1})\right]\leq% \varepsilon_{i}caligraphic_J start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_τ ∼ italic_π end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] ≤ italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (4)

where εisubscript𝜀𝑖\varepsilon_{i}italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the limit for each cumulative constraint.

The mean valued constraint is of the form:

𝒥hiπθ=𝔼τπ[1ttott=0ttot1hi(𝒔t,𝒂t,𝒔t+1)]εisuperscriptsubscript𝒥subscript𝑖subscript𝜋𝜃subscript𝔼similar-to𝜏𝜋delimited-[]1subscript𝑡totsuperscriptsubscript𝑡0subscript𝑡tot1subscript𝑖subscript𝒔𝑡subscript𝒂𝑡subscript𝒔𝑡1subscript𝜀𝑖\mathcal{J}_{h_{i}}^{\pi_{\theta}}=\mathbb{E}_{\tau\sim\pi}\left[\frac{1}{t_{% \text{tot}}}\sum_{t=0}^{t_{\text{tot}}-1}h_{i}(\bm{s}_{t},\bm{a}_{t},\bm{s}_{t% +1})\right]\leq\varepsilon_{i}caligraphic_J start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_τ ∼ italic_π end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG italic_t start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] ≤ italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (5)

where ttotsubscript𝑡tott_{\text{tot}}italic_t start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT is the total number of time steps in each trajectory.

The second group concerns the probability that the cumulative costs violate a constraint [13]. Probabilistic constraints are of the form:

𝒥hiπθ=[thi(𝒔t,𝒂t,𝒔t+1)εi]ζsuperscriptsubscript𝒥subscript𝑖subscript𝜋𝜃delimited-[]subscript𝑡subscript𝑖subscript𝒔𝑡subscript𝒂𝑡subscript𝒔𝑡1subscript𝜀𝑖𝜁\mathcal{J}_{h_{i}}^{\pi_{\theta}}=\mathbb{P}\left[\sum_{t}h_{i}(\bm{s}_{t},% \bm{a}_{t},\bm{s}_{t+1})\leq\varepsilon_{i}\right]\geq\zetacaligraphic_J start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = blackboard_P [ ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ≤ italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] ≥ italic_ζ (6)

where ηisubscript𝜂𝑖\eta_{i}italic_η start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the cumulative cost threshold for each trajectory and εi(0,1)subscript𝜀𝑖01\varepsilon_{i}\in(0,1)italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ ( 0 , 1 ) is the probability limit.

Here, it is important to emphasize again that in power systems, the majority of constraints must be satisfied at every instant, thus they are commonly implemented as instantaneous constraints. For example, [14] utilizes the expected discounted reward, whereas constraints related to branch power flow and security operations are treated as instantaneous constraints.

II-C Constraints in Power Systems: Overview

In power system applications, the classification of constraints into instantaneous and cumulative constraints is related to the required degree of constraint satisfaction and the safe RL algorithms used. Typically, bus balance equations, upper and lower power limits of various equipment, ESS capacity constraints, certain voltage amplitude constraints, and some stability constraints are considered hard constraints. Safe RL algorithms capable of ensuring the satisfaction of hard constraints include projection method III-B, Lyapunov method III-C, shielding method III-E, safety layer method III-F and barrier function method III-G. For example, [15] uses the logarithmic barrier function to make the 𝒥hiπθsuperscriptsubscript𝒥subscript𝑖subscript𝜋𝜃\mathcal{J}_{h_{i}}^{\pi_{\theta}}caligraphic_J start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT approach infinity when voltage exceeds bounds, thereby satisfying hard voltage constraints. Due to discrepancies between models and real systems, various uncertainties of RESs and loads, and algorithmic shortcomings, even if constraints are theoretically satisfied, they may not be guaranteed in actual deployment. Therefore, GP methods III-D and RRL III-G have been proposed, using the probabilistic/chance constraint (6). However, their application in power systems remains underexplored. A more common approach is to use constrained game-theoretic RL within RRL [14, 16]. Furthermore, by design some safe RL algorithms can only encourage constraint satisfaction while maximizing rewards. Such algorithms include Lagrangian relaxation III-A and penalty functions. For example, [17] uses the voltage constraint metric 𝒥hiπθ=i𝒩max{|𝒗i,t1|0.05|,0}\mathcal{J}_{h_{i}}^{\pi_{\theta}}=\sum_{i\in\cal N}\max\left\{|\bm{v}_{i,t}-1% |-0.05|,0\right\}caligraphic_J start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT roman_max { | bold_italic_v start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT - 1 | - 0.05 | , 0 } and employs Lagrangian relaxation for voltage control, which cannot guarantee absolute adherence to voltage constraints, thus classifying it as a soft constraint. For some constraints, instead, such as user satisfaction with EV charging and voltage control at certain nodes, the goal is to approach standard values as closely as possible, making them inherently soft constraints. The illustrations of different constraints of safe RL are shown in Fig. 2.

Refer to caption
Figure 2: Illustrations of different constraints of safe RL. (a): Cumulative constraints (4)-(5). (b): Probabilistic constraints (6). (c): Instantaneous constraints and hard constraints. (d): Soft constraint, where the final πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT may be either safe or unsafe.

III Safe Reinforcement Learning

Safe RL is often formulated as a CMDP problem, where the objective is to maximize the reward of agents while ensuring that the agents satisfy safety constraints [18, 4]. Safe RL is categorized into different types from various perspectives. This section primarily categorizes these types based on the techniques used to ensure constraint satisfaction and provides detailed introductions of the techniques and benchmarks.

III-A Lagrangian Relaxation / Primal-Dual Method

Lagrangian relaxation, also known as primal-dual method, is the most common technique in safe RL. The key idea of this method is to transform the CMDP problem into an unconstrained dual problem. This is achieved by employing adaptive Lagrange multipliers to penalize constraints [19]:

Instantaneous::Instantaneousabsent\displaystyle\textbf{Instantaneous}:~{}Instantaneous :
minλi0maxθ(λi,θ)=minλi0maxθ[JRπθiλihi]subscriptsubscript𝜆𝑖0subscript𝜃subscript𝜆𝑖𝜃subscriptsubscript𝜆𝑖0subscript𝜃superscriptsubscript𝐽𝑅subscript𝜋𝜃subscript𝑖subscript𝜆𝑖subscript𝑖\displaystyle\min_{\lambda_{i}\geq 0}\max_{\theta}\mathcal{L}(\lambda_{i},% \theta)=\min_{\lambda_{i}\geq 0}\max_{\theta}\left[J_{R}^{\pi_{\theta}}-\sum_{% i}\lambda_{i}\cdot h_{i}\right]roman_min start_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L ( italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_θ ) = roman_min start_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT [ italic_J start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] (7a)
Cumulative::Cumulativeabsent\displaystyle\textbf{Cumulative}:~{}Cumulative :
minλi0maxθ(λi,θ)=minλi0maxθ[JRπθiλi(Jhiπθεi)]subscriptsubscript𝜆𝑖0subscript𝜃subscript𝜆𝑖𝜃subscriptsubscript𝜆𝑖0subscript𝜃superscriptsubscript𝐽𝑅subscript𝜋𝜃subscript𝑖subscript𝜆𝑖superscriptsubscript𝐽subscript𝑖subscript𝜋𝜃subscript𝜀𝑖\displaystyle\min_{\lambda_{i}\geq 0}\max_{\theta}\mathcal{L}(\lambda_{i},% \theta)=\min_{\lambda_{i}\geq 0}\max_{\theta}\left[J_{R}^{\pi_{\theta}}-\sum_{% i}\lambda_{i}\cdot\left(J_{h_{i}}^{\pi_{\theta}}-\varepsilon_{i}\right)\right]roman_min start_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L ( italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_θ ) = roman_min start_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT [ italic_J start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ ( italic_J start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] (7b)

The solution of (7) relies on Danskin’s theorem and convex analysis [20]. Due to its straightforward implementation and compatibility with both on-policy and off-policy methods, Lagrangian relaxation has been integrated with other RL algorithms, fostering the creation of numerous variants, such as DDPG-Lag, PPO-Lag, TRPO-Lag, TD3-Lag, SAC-Lag, MAPPO, RCPO, PDO, TRPO-PID, CPPO-PID, DDPG-PID, TD3-PID, SAC-PID [21, 22, 19, 23].

The Lagrangian relaxation method is the most commonly used approach in power systems, capable of being easily integrated with various algorithms for application across a wide range of domains. Based on instantaneous or hard constraints, [24] utilizes a primal-dual approach to optimize the control of power generation and BESS charging and discharging actions in a multi-stage real-time stochastic dynamic OPF. Additionally, [25] applies constrained SAC to the Volt-VAR control problem by synergistically combining the merits of the maximum-entropy framework, the method of multipliers, a device-decoupled neural network structure, and an ordinal encoding scheme. Furthermore, [26] employs constrained RL for the predictive control of OPF, paired with EV charging control. On the other hand, based on cumulative or soft constraints, [27] approximates the actor gradients by solving the Karush-Kuhn-Tucker conditions of the Lagrangian, instead of constructing reward critic networks and cost critic networks through interactions with the environment. Then, the interior point method is incorporated to derive the parameter updating rule for the DRL agent. Similarly, [28] develops a soft-constraint enforcement method to adaptively encourage the control policy in the safety direction with nonconservative control actions and find decisions with near-zero degrees of constraint violations.

III-B Projection Method / Trust Region Method

The TRM ensures constraint satisfaction at every step and enhances performance by updating the trust region policy gradient and projecting the policy into a safe feasible set during each iteration [29]. Typical projection methods include CPO [9], PCPO [30], FOCOPS [31], CUP [32], and MACPO[22], among which PCPO is implemented through a two-step process: first, conducting a local reward update, and then projecting the policy back onto the constraint set to address any constraint violations, as depicted in Fig. 3.

Refer to caption
Figure 3: Update procedures for PCPO. In step one (red arrow), PCPO follows the reward improvement direction in the trust region (light green). In step two (blue arrow), PCPO projects the policy onto the constraint set (light orange).

In the power system domain, TRMs have also seen widespread application. For instance, [33] introduced a projection-embedded MA-DRL algorithm that smoothly and effectively restricts the DRL agent action space to prevent any violations of physical constraints, thereby achieving decentralized optimal control of distribution grids with a guaranteed 100% safety rate. Additionally, in the area of EV charging problems, [34] utilizes a penalty function to penalize the neural network output if it exceeds the action space and uses a projection operator to avoid incurring a negative reward when no EV is occupying the charging bay. In addition, [35] employs CPO for volt-VAR control to minimize the total operation costs while satisfying the physical operation constraints. However, TRMs, primarily based on TRPO or PPO, are not easily integrated with other RL types and are computationally intensive in high dimensions, limiting their suitability for large-scale safe RL problems [36].

III-C Lyapunov Method

Lyapunov functions, widely used in control engineering for controller design [37], were first applied to safe RL in [38]. The application of the Lyapunov method in power systems is limited because it requires prior knowledge of a Lyapunov function. If the model of environmental dynamics is unknown, identifying a suitable Lyapunov function can be challenging. For example, [39] integrates a Lyapunov function into the structural properties of primary frequency controllers, guaranteeing local asymptotic stability over a large set of states. Additionally, [40] utilizes Lyapunov theory to design the controller that satisfies specific Lipschitz constraints for decentralized inverter-based voltage control. In addition, [41] utilizes a stability-constrained RL method for real-time voltage control in distribution grids, providing a formal voltage stability guarantee using the Lyapunov function.

III-D Gaussian Process Method

GP [42] is widely utilized in numerous approaches to estimate uncertainty and identify unsafe areas. Consequently, assessments based on GP can be incorporated into the learning process to enhance agent safety [43]. GP-based safe RL algorithms include SafeOpt [44] and PILCO [45]. The application of GP method-based safe RL in power systems is limited, meriting further research to adequately address the various uncertainties inherent in power systems. The potential disadvantage of GP methods is their computational complexity and scalability issues, especially as the dimensionality of the problem space increases [36].

III-E Shielding Method

In [46], the shield is introduced for the first time in RL. This shield is computed in advance, based on the safety component of the system specification provided and an abstraction of the dynamics of the agent’s environment. It guarantees safety with minimal interference, implying that the shield limits the agent’s actions as little as necessary, only prohibiting actions that could jeopardize the safe behavior of the system. The shielded RL is shown in Fig. 4.

Refer to caption
Figure 4: Shielded RL. Machine learning is applied to control systems in such a way that the correctness of the system’s execution against a given specification is assured during both the learning and controller execution phases, regardless of the convergence speed of the learning process.

Shielding is a method that enforces constraint satisfaction, making it highly suitable for power system problems with hard constraints. For instance, in [47], actions that would lead to dangerous states, such as the SoC of BESSs being fully charged or depleted, are substituted by the shielding mechanism with safe actions to maintain system stability. Additionally, [48] combines a correction model adapted from gradient descent with the prediction model as a post-posed shielding mechanism to enforce safe actions in computer room air conditioning unit control problems. In addition, in unit commitment scheduling, [49] utilizes action space clip** to ensure that uncertainty estimates are reasonable and within appropriate bounds obtained from historical data. A potential drawback of the shielding method is the challenge of identifying feasible, safe actions based on infeasible ones, which requires underlying knowledge of the system. This can be difficult for certain complex systems or specific control scenarios [36].

III-F Safety Layer Method

Both the safety layer and shielding method integrate safety into the RL process, but they differ in their implementation: the safety layer acts as an additional check within the RL framework, whereas shielding employs an external system (the shield) that intervenes only when necessary to prevent unsafe actions. The safety layer method, first proposed in [50] for continuous action spaces in RL, emphasizes maintaining zero-constraint violations throughout the learning process. It expresses safety constraints as linear functions of action through a first-order approximation. Assuming that at most one constraint is violated at any time, an analytical solution to the safety layer optimization problem can be directly obtained. The linearization equation and visualization of the safety layer are shown in (8) and 5, respectively.

h¯i(st+1)hi(st,at)h¯i(st)+g(st;wi)Tatsubscript¯𝑖subscript𝑠𝑡1subscript𝑖subscript𝑠𝑡subscript𝑎𝑡subscript¯𝑖subscript𝑠𝑡𝑔superscriptsubscript𝑠𝑡subscript𝑤𝑖𝑇subscript𝑎𝑡\overline{h}_{i}(s_{t+1})\triangleq h_{i}(s_{t},a_{t})\approx\overline{h}_{i}(% s_{t})+g(s_{t};w_{i})^{T}a_{t}over¯ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ≜ italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≈ over¯ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_g ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (8)

where wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are weights of NN; g(st;wi)𝑔subscript𝑠𝑡subscript𝑤𝑖g(s_{t};w_{i})italic_g ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) denotes first-order approximation to hi(st,at)subscript𝑖subscript𝑠𝑡subscript𝑎𝑡h_{i}(s_{t},a_{t})italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) with respect to atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Refer to caption
Figure 5: Safety layer. Each safety signal hi(s,a)subscript𝑖𝑠𝑎h_{i}(s,a)italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s , italic_a ) is approximated with a linear model with respect to a𝑎aitalic_a, whose coefficients are features of s𝑠sitalic_s, extracted with a NN.

The safety layer method has been widely applied in power systems. For example, in optimal power generation dispatch, [51] proposes a hybrid knowledge-data-driven safety layer to convert unsafe actions into the safety region, which is accelerated by a security-constrained linear projection model. Additionally, in volt-VAR control, [52] adds a safety layer to the policy neural network to enhance operational constraint satisfaction during both the initial exploration phase and the convergence phase. In addition, [53] uses action clip**, reward sha**, and expert demonstrations to ensure safe exploration and accelerate the training process during the online training stage for the assist service restoration problem. However, the linear approximation in the safety layer might not accurately capture the complexities of underlying dynamics in highly non-linear systems, and iterating at every time step could introduce a significant computational burden. Moreover, assuming only one constraint at a time may not be valid in complex environments where multiple safety constraints are concurrently active.

III-G Barrier Function Method

The barrier function method involves adding a barrier function penalty term to the original objective function. When the system state approaches the safety boundary, the value of the constructed barrier function tends to infinity, thereby ensuring that the state remains within the safe boundary [54]. The most typical barrier function method is IPO, which augments the objective with logarithmic barrier functions, drawing inspiration from the interior-point method [55]:

Instantaneous::Instantaneousabsent\displaystyle\textbf{Instantaneous}:Instantaneous : maxθJRπθ+i1tilog(hi)subscript𝜃superscriptsubscript𝐽𝑅subscript𝜋𝜃subscript𝑖1subscript𝑡𝑖subscript𝑖\displaystyle~{}\max_{\theta}J_{R}^{\pi_{\theta}}+\sum_{i}\frac{1}{t_{i}}\log(% -h_{i})roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG roman_log ( - italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (9a)
Cumulative::Cumulativeabsent\displaystyle\textbf{Cumulative}:Cumulative : maxθJRπθ+i1tilog(Jhiπθ+εi)subscript𝜃superscriptsubscript𝐽𝑅subscript𝜋𝜃subscript𝑖1subscript𝑡𝑖superscriptsubscript𝐽subscript𝑖subscript𝜋𝜃subscript𝜀𝑖\displaystyle~{}\max_{\theta}J_{R}^{\pi_{\theta}}+\sum_{i}\frac{1}{t_{i}}\log(% -J_{h_{i}}^{\pi_{\theta}}+\varepsilon_{i})roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG roman_log ( - italic_J start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (9b)

where tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a hyperparameter for hisubscript𝑖h_{i}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The illustration of IPO is shown in Fig. 6.

Refer to caption
Figure 6: Barrier function. The solid red line represents the logarithm barrier function log(Jhπθ+ε)/tsuperscriptsubscript𝐽subscript𝜋𝜃𝜀𝑡\log(-J_{h}^{\pi_{\theta}}+\varepsilon)/troman_log ( - italic_J start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + italic_ε ) / italic_t, which is a differentiable approximation of the indicator function I(x)𝐼𝑥I(x)italic_I ( italic_x ).

Barrier function method and IPO have been widely applied in power systems to ensure the safety of constraints. For example, [12] utilizes IPO to ensure the fulfillment of distribution network constraints without the need for designated penalty terms and the associated tuning of penalty factors, or repeatedly solving optimization problems for action rectification. Additionally, [56] uses IPO to facilitate desirable learning behavior towards constraint satisfaction and policy improvement simultaneously during online preventive control for transmission overload relief. In addition, [57] proposes a safe RL method for emergency load shedding in power systems, where the reward function includes a barrier function that approaches negative infinity as the system state approaches safety bounds. However, the accurate formulation and tuning of barrier functions necessitate knowledge of system dynamics, which can be challenging in complex environments.

III-H Robust Reinforcement Learning

One of the challenges in RL is generalization under uncertainties not seen during training. To address this, RRL frameworks have been developed, focusing on enhancing the reliability and robustness of RL agents for the worst-case scenarios [58, 59]. Two notable approaches in this context are chance-constrained RRL and constrained game-theoretic RL. It is important to note that RRL is not universally recognized as a safe RL algorithm in other fields. However, due to the significant uncertainties in power systems, RRL is employed to enhance control robustness and is reviewed here.

III-H1 Chance-constrained RRL

Chance-constrained RRL, in particular, focuses on ensuring that policies perform well under uncertain conditions by incorporating probabilistic constraints into the learning process [60]. In this framework, the goal is not just to maximize expected rewards but to do so while ensuring that the probability of undesirable outcomes (e.g., safety violations) remains below a specified threshold [61]. This is particularly important in scenarios where safety and reliability are critical, such as autonomous driving or robotics [62]. The general form can be expressed as:

maxπ𝒥Rπθsubscript𝜋superscriptsubscript𝒥𝑅subscript𝜋𝜃\displaystyle\max_{\pi}\mathcal{J}_{R}^{\pi_{\theta}}{}{}{}{}{}{}{}{}{}{}{}{}{% }{}{}{}{}{}{}{}{}roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT caligraphic_J start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (10)
s.t.[minihi(𝒔t,𝒂t,𝒔t+1)εi]ζ,t𝒯formulae-sequences.t.delimited-[]subscript𝑖subscript𝑖subscript𝒔𝑡subscript𝒂𝑡subscript𝒔𝑡1subscript𝜀𝑖𝜁for-all𝑡𝒯\displaystyle\text{s.t.}~{}~{}\mathbb{P}\left[\min_{i}h_{i}(\bm{s}_{t},\bm{a}_% {t},\bm{s}_{t+1})\leq\varepsilon_{i}\right]\geq\zeta,\forall t\in\mathcal{T}s.t. blackboard_P [ roman_min start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ≤ italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] ≥ italic_ζ , ∀ italic_t ∈ caligraphic_T

III-H2 Constrained game-theoretic RL

Constrained game-theoretic RL is a framework that models the interaction between the RL agent and its environment as a game, specifically focusing on scenarios where there are constraints that the agent must respect during the learning and decision-making processes [63]. The objective is to maximize the agent’s rewards while minimizing the possible losses or costs, considering the worst-case scenarios posed by adversaries’ actions or environmental uncertainties [64]. Here’s a more accurate representation using a minimax optimization framework [63]:

minπθadvmaxπθsubscriptsuperscriptsubscript𝜋𝜃advsubscriptsubscript𝜋𝜃\displaystyle\min_{\pi_{\theta}^{\text{adv}}}\max_{\pi_{\theta}}roman_min start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT adv end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT 𝔼τπ[t=0γtR(st,at,atadv,st+1)]subscript𝔼similar-to𝜏𝜋delimited-[]superscriptsubscript𝑡0superscript𝛾𝑡𝑅subscript𝑠𝑡subscript𝑎𝑡superscriptsubscript𝑎𝑡advsubscript𝑠𝑡1\displaystyle~{}\mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^{\infty}\gamma^{t}R(s% _{t},a_{t},a_{t}^{\text{adv}},s_{t+1})\right]blackboard_E start_POSTSUBSCRIPT italic_τ ∼ italic_π end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT adv end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] (11)
s.t.his.t.subscript𝑖\displaystyle\text{s.t.}~{}~{}h_{i}s.t. italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (st,at,atadv,st+1)0,t𝒯formulae-sequencesubscript𝑠𝑡subscript𝑎𝑡superscriptsubscript𝑎𝑡advsubscript𝑠𝑡10for-all𝑡𝒯\displaystyle(s_{t},a_{t},a_{t}^{\text{adv}},s_{t+1})\leq 0,\forall t\in% \mathcal{T}( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT adv end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ≤ 0 , ∀ italic_t ∈ caligraphic_T

One of the key benefits of constrained game-theoretic RL is its ability to handle competitive and cooperative interactions within complex environments, making it suitable for applications ranging from strategic games to cooperative multi-agent scenarios like mobile edge computing [65] and coordination in robotic teams [66].

RRL is applied in power systems to ensure that control strategies remain robust under various uncertainties. For example, [14] utilizes adversarial safe RL to address the model inaccuracy and uncertainty of virtual power plants without relying on an accurate environmental model. Additionally, in the sequential OPF problem, [51] employs a bi-level robust optimization approach to optimize the training loss of the Q network. In addition, in the inverter-based volt-VAR control problem, [16] develops a highly efficient adversarial RL algorithm to train an offline agent that is robust to model mismatches during the offline stage.

III-I Benchmarks

Benchmarks include both benchmark environments and benchmark algorithms. Safety Gym, developed by OpenAI, is the first widely recognized safe benchmark environment. It includes an environment-builder and a suite of pre-configured benchmark environments [21, 67]. Correspondingly, Safety Starter Agents, a benchmark algorithm library, has been developed based on Safety Gym [68]. The supported algorithms in this library include PPO, PPO-Lag, TRPO, TRPO-Lag, SAC, SAC-Lag, and CPO. This package has been tested on Mac OS Mojave and Ubuntu 16.04 LTS and is likely compatible with most recent Mac and Linux operating systems.

Safety Gymnasium, an update and extension of Safety Gym, has currently become the mainstream platform in use [69, 70]. Correspondingly, a benchmark repository for safe RL algorithms has been proposed, named SafePO [71]. SafePO is tested on the Linux platform and potentially supports Mac or Windows, requiring only modifications to the Linux path and sort functions for compatibility.

SafePO further extends the variety of supported safe RL algorithms, as illustrated in Fig. 7.

Refer to caption
Figure 7: Supported safe RL algorithms of SafePO.

OmniSafe emerges as the first unified learning framework in the field of safe RL, featuring a highly modular framework that includes a comprehensive collection of algorithms specifically developed for safe RL across various domains. Its versatility comes from an abstracted algorithm structure and a well-designed API, facilitating seamless integration of different components, thereby simplifying extension and customization for developers. Additionally, OmniSafe enhances algorithm learning speeds through process parallelism, supporting both environment-level and agent asynchronous parallel learning. OmniSafe is supported and tested on Linux and also supports M1 and M2 versions of macOS. However, it does not support Windows [72, 73]. The supported safe RL algorithms of OmniSafe are shown in Table I.

TABLE I: Supported Safe RL Algorithms of OmniSafe
Domains Types Algorithms Registry
On Policy Primal-Dual TRPO-Lag; PPO-Lag; PDO; RCPO
Convex Optimization CPO; PCPO; FOCOPS; CUP
Penalty Function IPO; P3O
Primal OnCRPO
Off Policy Primal-Dual DDPG-Lag; TD3-Lag; SAC-Lag
DDPG-PID; TD3-PID; SAC-PID
Model-based Online Plan SafeLOOP; CCEPETS; RCEPETS
Pessimistic Estimate CAPPETS
Offline Q-Learning-Based BCQ-Lag; C-CRR
DICE-Based COptDICE
ET-MDP PPO/TRPO-EarlyTerminated
Other MDP SauteRL PPOSaute; TROPSaute
SimmerRL PPOSimmer-PID; TROPSimmer-PID

Overall, Safety Gymnasium is the current mainstream benchmark environment, and OmniSafe has also integrated Safety Gymnasium to ensure overall code compatibility. It is important to remark that Safety Gymnasium was primarily developed for control in gaming, robotics, autonomous driving, etc., featuring a series of agents such as point, car, dog, and ant, among others. It offers several specific environments tailored for challenges such as safe navigation, safe velocity, and safe vision, but it is not directly applicable to power systems problems’ formulations. Hence, there is a need to develop corresponding power system control environments based on the environment templates provided by Safety Gymnasium. In terms of benchmark algorithms, OmniSafe offers a more comprehensive set of algorithms but currently does not support Windows due to difficulties with Python library installations. In contrast, SafePO is more easily expanded on Windows. Since most power system professional software is developed for Windows, with less support for Linux and macOS, this may limit the application of OmniSafe in model-based environments. However, if surrogate models are used to substitute for physical models in a model-free environment, OmniSafe can be utilized in Linux or macOS.

Refer to caption
Figure 8: RL schemes for the safe control and decision-making in power systems.

IV Power System Applications of Safe RL

This review synthesizes a broad collection of studies and applications of safe RL in power systems, covering a wide array of domains: optimal power generation dispatch, voltage control, stability control, EV charging control, building energy management, electricity market, system restoration, and unit commitment and reserve scheduling. Safe RL algorithms used in various application domains are presented in Fig. 1. As depicted in Fig. 8, RL-based schemes collect power system measurements, including PMU and AMI readings, and integrate system model knowledge into their policy training. They take action to control power system devices, ensuring safety requirements like feasibility, stability, and robustness are met. The research problem or objective function, constraint, constraint type (cumulative/instantaneous and hard/soft), applied safe constraint techniques, and key features are reviewed to compare different researches using safe RL across various domains.

IV-A Optimal Power Generation Dispatch

TABLE II: Safe RL Applications in Optimal Power Generation Dispatch

Research Problem/ Objective Constraint Constraint Type Safety Constraint Techniques Key Features [27] Minimize the total generation cost Physical operation constraints Cum/Soft Primal-dual method (III-A) Combines the primal-dual DDPG with the classic SCOPF model. The actor gradients are approximated by solving the Karush-Kuhn-Tucker conditions of the Lagrangian. [24] Minimize the fuel costs and power loss from BESSs Physical constraints Ins/Hard Projection (III-B) and primal-dual method (III-A) A primal-dual approach is introduced to learn optimal constrained DRL policies specifically for predictive control in real-time stochastic dynamic OPF. [74] Minimize the total system cost Physical constraints Cum/Hard Safety layer (III-F) Unsafe actions are projected into the safe action space while constrained zonotope set is used to improve efficiency. [75] Minimize the cost of thermal power MESS Power grid and MESSs constraints Ins/Hard Proximal gradient projection (III-B) MESSs are modeled as CMDP, and a framework is proposed based on a DRL algorithm that considered the discrete-continuous hybrid action space of the MESSs. [15] Minimize the total energy cost Power system constraints Cum/Hard Lagrange relaxation (III-A) and logarithmic barrier (III-G) Function approximation addresses large, continuous state spaces, while a diffusion strategy coordinates actions of DG units and ESSs. [76] Minimize the generator fuel cost Power system constraints Ins/Hard Safety layer (III-F) The proposed method uses physics-driven parameters for easy modification and less conservative, easily re-parameterizable actions. [77] Minimize the operating cost Power system constraints Ins/Hard Safety layer (III-F) To avoid line overload, a safety layer is added by introducing transmission constraints to avoid dangerous actions and tackle sequential security-constrained OPF problem. [10] Minimize the total operating cost Physical constraints of system and devices Cum/Hard CPO (III-B) To optimize both discrete and continuous actions, a stochastic policy based on a joint distribution of mixed random variables is designed and learned through a NN approximator. [11] Minimize the total cost of operation of microgrids Global and local constraints Cum/Soft Lagrangian relaxation (III-A) and projection (III-B) The training process employs the gradient information of operational constraints to ensure that the optimal control policy functions generate safe and feasible decisions. [78] Minimize the operational cost Operation and power balance constraints Cum/Hard CPO (III-B) and invalid action masking (III-E) Invalid action masking is applied to avoid invalid actions, accomplished by replacing the logits of the actions to be masked with a large negative number. [79] Minimize the total operational cost AC-PF constraints Cum/Hard CPO (III-B) Contrary to traditional DRL methods, the proposed method constrains exploration to only those policies that comply with AC-PF constraints. [28] Minimize the total operational cost Gas system and power system constraints Cum/Soft Lagrangian relaxation (III-A) The penalty is adaptively updated based on the extent of constraint violation, facilitating the prediction of near-optimal control actions that achieve near-zero degrees of violation. [80] Minimize the operating cost for the whole horizon Operational constraints Ins/Hard MIP formulation The action-value function, approximated through a DNN, is structured as a MIP formulation, enabling the inclusion of constraints within the action space. [81] Optimize the total generation cost Operational and linguistic stipulation constraints N.A./Soft Primal-dual method (III-A) For the first time, a GPT LLM is integrated into the OPF framework alongside linguistic rules. This novel approach models and quantifies natural language stipulations as objectives and constraints within a primal-dual DRL loop. [82] Minimize the total operation cost Operational constraints N.A./Soft Lagrangian relaxation (III-A) Instead of using the critic network, the deterministic gradient is derived analytically and solved by using interior point method. [83] Minimize the total energy cost Satisfaction of the energy demand Cum/Soft Lagrangian relaxation (III-A) and RRL (III-H) This approach efficiently uses short-horizon forecasts to prevent energy demand failures and reduce costs, surpassing the capabilities of standard safe RL methods. [12] Minimize the costs of DGs production and RES curtailment Constraints of distribution network Cum/Hard IPO (III-G) The generalization of IPO is improved by extracting spatial-temporal features from microgrid operation data, leveraging the advantages of edge-conditioned convolutional networks and long short-term memory networks. [84] Multi-energy management Thermal energy balance Cum/Hard Shielding method (III-E) Decoupling architecture of safety constraint formulations from the RL formulation. Hard-constraint satisfaction without the need to solve a mathematical program. [85] Minimize the cost of electricity net, DG and gas Constraints of the power and gas networks Ins/Hard Safety layer (III-F) By learning a dynamic security assessment rule, a physically-informed safety layer ensures adherence to physical constraints by solving an action correction formulation. [14] Minimize the overall operation cost Branch power flow security constraint Ins/Soft Lagrangian relaxation (III-A) and RRL (III-H) An adversarial safe RL approach is proposed to enhance action safety and robustness against deviations between training and testing environments. [51] Minimize the operation cost Operational constraints Ins/Hard Safety layer (III-F), projection (III-B), and RRL (III-H) A safety layer that blends knowledge and data-driven approaches is created. Also, security constraints and linear projection are combined to improve computational speed.

  • Cum: Cumulative; Ins: Instantaneous; N.A.: Not applicable or not available.

Optimal power generation dispatch considering various constraints, ranging from simplified versions to security constraints, including economic dispatch, DC-OPF, AC-OPF, and SCOPF. The operation of a power system must meet both security and economic requirements. Considering credible contingencies, AC-OPF has been widely used [79, 86]. Most existing methods for solving OPF rely on analytical methods; however, given the inherently large scale of these problems, real-time computation is very challenging. A new variation of OPF is the SCOPF. This type of problem requires significantly longer computation times due to the additional security constraints [27]. To accelerate the calculation of SCOPF, methods such as DC-PF approximation [87], convex power flow approximation [88], and convex security constraint approximation [89] have been proposed. However, the accuracy of these methods has been questioned, and they remain time-consuming for large-scale systems. To accelerate computation and achieve better solutions, RL methods have been widely applied. Since traditional RL struggles to handle safety constraints effectively, safe RL has been further applied to address these issues.

The details of the applications of safe RL in optimal power generation dispatch are shown in Table II. Based on Table II, we summarize the foundational framework for implementing safe RL in optimal power generation dispatch with a specific example with SGs, RESs, and BESSs, incorporating strict physics-based constraints such as AC- and DC-PF constraints. If the system encompasses additional power system devices, the presented equations are designed to be readily scalable to accommodate them. Note that the models presented below are examples for illustration, and there are other RL formulations and models for optimal power generation dispatch depending on the specific problem setting. This is also true for other application domains. The state, action, reward, and constraints of optimal power generation dispatch are shown as follows.

IV-A1 AC-PF

AC-PF constraints describe the basic physics of power systems, which have been widely considered in optimal power generation dispatch, voltage control, unit commitments, etc.

State

The states include active and reactive loads and voltage:

𝒔tAC(𝒗t,𝒑tLoad,𝒒tLoad)subscriptsuperscript𝒔AC𝑡subscript𝒗𝑡subscriptsuperscript𝒑Load𝑡subscriptsuperscript𝒒Load𝑡\bm{s}^{\text{AC}}_{t}\triangleq\left(\bm{v}_{t},\bm{p}^{\text{Load}}_{t},\bm{% q}^{\text{Load}}_{t}\right)bold_italic_s start_POSTSUPERSCRIPT AC end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ ( bold_italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT Load end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_q start_POSTSUPERSCRIPT Load end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (12)
Action

The control actions encompass both active and reactive power generation of SGs, active power generation of RESs, alongside power charging or discharging of BESSs:

𝒂tAC(𝒑tSG,𝒒tSG,𝒑tRES,𝒑ch,tBESS,𝒑dis,tBESS)subscriptsuperscript𝒂AC𝑡subscriptsuperscript𝒑SG𝑡subscriptsuperscript𝒒SG𝑡subscriptsuperscript𝒑RES𝑡subscriptsuperscript𝒑BESSch𝑡subscriptsuperscript𝒑BESSdis𝑡\bm{a}^{\text{AC}}_{t}\triangleq\left(\bm{p}^{\text{SG}}_{t},\bm{q}^{\text{SG}% }_{t},\bm{p}^{\text{RES}}_{t},\bm{p}^{\text{BESS}}_{\text{ch},t},\bm{p}^{\text% {BESS}}_{\text{dis},t}\right)bold_italic_a start_POSTSUPERSCRIPT AC end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ ( bold_italic_p start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_q start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT RES end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch , italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis , italic_t end_POSTSUBSCRIPT ) (13)
Reward

The reward includes SGs generation cost, wind curtailment cost, and BESSs cost:

maxπθΠSsubscriptsubscript𝜋𝜃subscriptΠ𝑆\displaystyle\max_{\pi_{\theta}\in\Pi_{S}}roman_max start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∈ roman_Π start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT 𝔼τπ[t=0γtR(𝒔t,𝒂t,𝒔t+1)]subscript𝔼similar-to𝜏𝜋delimited-[]superscriptsubscript𝑡0superscript𝛾𝑡𝑅subscript𝒔𝑡subscript𝒂𝑡subscript𝒔𝑡1\displaystyle\mathbb{E}_{\tau\sim\pi}\left[\sum_{t=0}^{\infty}\gamma^{t}R(\bm{% s}_{t},\bm{a}_{t},\bm{s}_{t+1})\right]blackboard_E start_POSTSUBSCRIPT italic_τ ∼ italic_π end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_R ( bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] (14a)
RAC(𝒔,𝒂)superscript𝑅AC𝒔𝒂\displaystyle R^{\text{AC}}(\bm{s},\bm{a})italic_R start_POSTSUPERSCRIPT AC end_POSTSUPERSCRIPT ( bold_italic_s , bold_italic_a ) =|i𝒢(aiSG(pi,tSG)2+biSGpi,tSG+ciSG)|absentsubscriptfor-all𝑖𝒢subscriptsuperscript𝑎SG𝑖superscriptsubscriptsuperscript𝑝SG𝑖𝑡2subscriptsuperscript𝑏SG𝑖subscriptsuperscript𝑝SG𝑖𝑡subscriptsuperscript𝑐SG𝑖\displaystyle=-\left|\sum_{\forall i\in\mathcal{G}}\left(a^{\text{SG}}_{i}(p^{% \text{SG}}_{i,t})^{2}+b^{\text{SG}}_{i}p^{\text{SG}}_{i,t}+c^{\text{SG}}_{i}% \right)\right|= - | ∑ start_POSTSUBSCRIPT ∀ italic_i ∈ caligraphic_G end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_p start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_b start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT + italic_c start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) |
iciRES|pMPPT,i,tRESpi,tRES|subscriptfor-all𝑖subscriptsuperscript𝑐RES𝑖subscriptsuperscript𝑝RESMPPT𝑖𝑡subscriptsuperscript𝑝RES𝑖𝑡\displaystyle\quad-\sum_{\forall i\in\mathcal{R}}c^{\text{RES}}_{i}\left|p^{% \text{RES}}_{\text{MPPT},i,t}-p^{\text{RES}}_{i,t}\right|- ∑ start_POSTSUBSCRIPT ∀ italic_i ∈ caligraphic_R end_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT RES end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_p start_POSTSUPERSCRIPT RES end_POSTSUPERSCRIPT start_POSTSUBSCRIPT MPPT , italic_i , italic_t end_POSTSUBSCRIPT - italic_p start_POSTSUPERSCRIPT RES end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT |
icdis,iBESSpdis,i,tBESS+icch,iBESSpch,i,tBESSsubscriptfor-all𝑖subscriptsuperscript𝑐BESSdis𝑖subscriptsuperscript𝑝BESSdis𝑖𝑡subscriptfor-all𝑖subscriptsuperscript𝑐BESSch𝑖subscriptsuperscript𝑝BESSch𝑖𝑡\displaystyle\quad-\sum_{\forall i\in\mathcal{B}}c^{\text{BESS}}_{\text{dis},i% }p^{\text{BESS}}_{\text{dis},i,t}+\sum_{\forall i\in\mathcal{B}}c^{\text{BESS}% }_{\text{ch},i}p^{\text{BESS}}_{\text{ch},i,t}- ∑ start_POSTSUBSCRIPT ∀ italic_i ∈ caligraphic_B end_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis , italic_i end_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis , italic_i , italic_t end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT ∀ italic_i ∈ caligraphic_B end_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch , italic_i end_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch , italic_i , italic_t end_POSTSUBSCRIPT (14b)
𝒔tACsubscriptsuperscript𝒔AC𝑡\displaystyle\bm{s}^{\text{AC}}_{t}bold_italic_s start_POSTSUPERSCRIPT AC end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =ft(𝒔t1AC,𝒂t1AC)𝒂tACπ(𝒂tAC|𝒔t1AC)absentsubscript𝑓𝑡subscriptsuperscript𝒔AC𝑡1subscriptsuperscript𝒂AC𝑡1subscriptsuperscript𝒂AC𝑡similar-to𝜋conditionalsubscriptsuperscript𝒂AC𝑡subscriptsuperscript𝒔AC𝑡1\displaystyle=f_{t}(\bm{s}^{\text{AC}}_{t-1},\bm{a}^{\text{AC}}_{t-1})~{}~{}~{% }\bm{a}^{\text{AC}}_{t}\sim\pi(\bm{a}^{\text{AC}}_{t}|\bm{s}^{\text{AC}}_{t-1})= italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUPERSCRIPT AC end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_italic_a start_POSTSUPERSCRIPT AC end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) bold_italic_a start_POSTSUPERSCRIPT AC end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π ( bold_italic_a start_POSTSUPERSCRIPT AC end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_s start_POSTSUPERSCRIPT AC end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) (14c)
Constraint

The control actions derived from DRL must adhere to physics-hard constraints. AC-PF constraints include bus active and reactive power balance constraints, SG active and reactive power generation constraints, RES active power generation constraints, voltage constraints, and branch apparent power constraints:

𝐌BESS𝒑dis,tBESS𝐌BESS𝒑ch,tBESS+𝐌SG𝒑tSG+superscript𝐌BESSsubscriptsuperscript𝒑BESSdis𝑡superscript𝐌BESSsubscriptsuperscript𝒑BESSch𝑡limit-fromsuperscript𝐌SGsuperscriptsubscript𝒑𝑡SG\displaystyle\mathbf{M}^{\text{BESS}}\bm{p}^{\text{BESS}}_{\text{dis},t}-% \mathbf{M}^{\text{BESS}}\bm{p}^{\text{BESS}}_{\text{ch},t}+\mathbf{M}^{\text{% SG}}\bm{p}_{t}^{\text{SG}}+bold_M start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT bold_italic_p start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis , italic_t end_POSTSUBSCRIPT - bold_M start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT bold_italic_p start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch , italic_t end_POSTSUBSCRIPT + bold_M start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT +
𝐌RES𝒑tRES𝒑tLoad={𝔻(𝒗t𝒗t𝐘)}superscript𝐌RESsuperscriptsubscript𝒑𝑡RESsubscriptsuperscript𝒑Load𝑡𝔻subscript𝒗𝑡superscriptsubscript𝒗𝑡superscript𝐘\displaystyle\mathbf{M}^{\text{RES}}\bm{p}_{t}^{\text{RES}}-\bm{p}^{\text{Load% }}_{t}=\Re\{\mathbb{D}(\bm{v}_{t}\bm{v}_{t}^{\mathcal{H}}\mathbf{Y}^{\mathcal{% H}})\}bold_M start_POSTSUPERSCRIPT RES end_POSTSUPERSCRIPT bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT RES end_POSTSUPERSCRIPT - bold_italic_p start_POSTSUPERSCRIPT Load end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_ℜ { blackboard_D ( bold_italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_H end_POSTSUPERSCRIPT bold_Y start_POSTSUPERSCRIPT caligraphic_H end_POSTSUPERSCRIPT ) } (15a)
𝐌SG𝒒tSG𝒒tLoad={𝔻(𝒗t𝒗t𝐘)}superscript𝐌SGsuperscriptsubscript𝒒𝑡SGsubscriptsuperscript𝒒Load𝑡𝔻subscript𝒗𝑡superscriptsubscript𝒗𝑡superscript𝐘\displaystyle\mathbf{M}^{\text{SG}}\bm{q}_{t}^{\text{SG}}-\bm{q}^{\text{Load}}% _{t}=\Im\{\mathbb{D}(\bm{v}_{t}\bm{v}_{t}^{\mathcal{H}}\mathbf{Y}^{\mathcal{H}% })\}bold_M start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT - bold_italic_q start_POSTSUPERSCRIPT Load end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_ℑ { blackboard_D ( bold_italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_H end_POSTSUPERSCRIPT bold_Y start_POSTSUPERSCRIPT caligraphic_H end_POSTSUPERSCRIPT ) } (15b)
𝒑¯SG𝒑tSG𝒑¯SG𝒒¯SG𝒒tSG𝒒¯SGsuperscript¯𝒑SGsubscriptsuperscript𝒑SG𝑡superscript¯𝒑SGsuperscript¯𝒒SGsubscriptsuperscript𝒒SG𝑡superscript¯𝒒SG\displaystyle\underline{\bm{p}}^{\text{SG}}\leq\bm{p}^{\text{SG}}_{t}\leq% \overline{\bm{p}}^{\text{SG}}~{}~{}~{}\underline{\bm{q}}^{\text{SG}}\leq\bm{q}% ^{\text{SG}}_{t}\leq\overline{\bm{q}}^{\text{SG}}under¯ start_ARG bold_italic_p end_ARG start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT ≤ bold_italic_p start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ over¯ start_ARG bold_italic_p end_ARG start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT under¯ start_ARG bold_italic_q end_ARG start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT ≤ bold_italic_q start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ over¯ start_ARG bold_italic_q end_ARG start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT (15c)
𝒑¯RES𝒑tRES𝒑¯RES𝒗¯|𝒗|𝒗¯|sij|s¯ijsuperscript¯𝒑RESsubscriptsuperscript𝒑RES𝑡superscript¯𝒑RES¯𝒗𝒗¯𝒗subscript𝑠𝑖𝑗subscript¯𝑠𝑖𝑗\displaystyle\underline{\bm{p}}^{\text{RES}}\leq\bm{p}^{\text{RES}}_{t}\leq% \overline{\bm{p}}^{\text{RES}}~{}~{}~{}\underline{\bm{v}}\leq|{\bm{v}}|\leq% \overline{\bm{v}}~{}~{}~{}|{s}_{ij}|\leq\overline{s}_{ij}under¯ start_ARG bold_italic_p end_ARG start_POSTSUPERSCRIPT RES end_POSTSUPERSCRIPT ≤ bold_italic_p start_POSTSUPERSCRIPT RES end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ over¯ start_ARG bold_italic_p end_ARG start_POSTSUPERSCRIPT RES end_POSTSUPERSCRIPT under¯ start_ARG bold_italic_v end_ARG ≤ | bold_italic_v | ≤ over¯ start_ARG bold_italic_v end_ARG | italic_s start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | ≤ over¯ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT (15d)

where 𝐌SGsuperscript𝐌SG\mathbf{M}^{\text{SG}}bold_M start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT denotes the matrix {0,1}N×Gsuperscript01𝑁𝐺\{0,1\}^{N\times G}{ 0 , 1 } start_POSTSUPERSCRIPT italic_N × italic_G end_POSTSUPERSCRIPT that maps the generation vector 𝒑tSG|𝒢|superscriptsubscript𝒑𝑡SGsuperscript𝒢\bm{p}_{t}^{\text{SG}}\in\mathbb{R}^{|{\cal G}|}bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT | caligraphic_G | end_POSTSUPERSCRIPT to Nsuperscript𝑁\mathbb{R}^{N}blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT:

[𝐌SG𝒑tSG]i=0[𝐌SG𝒒tSG]i=0,i𝒩𝒢formulae-sequencesubscriptdelimited-[]superscript𝐌SGsuperscriptsubscript𝒑𝑡SG𝑖0subscriptdelimited-[]superscript𝐌SGsuperscriptsubscript𝒒𝑡SG𝑖0for-all𝑖𝒩𝒢\displaystyle[\mathbf{M}^{\text{SG}}\bm{p}_{t}^{\text{SG}}]_{i}=0~{}~{}~{}[% \mathbf{M}^{\text{SG}}\bm{q}_{t}^{\text{SG}}]_{i}=0,~{}~{}\forall i\in\mathcal% {N}\setminus\mathcal{G}[ bold_M start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 [ bold_M start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 , ∀ italic_i ∈ caligraphic_N ∖ caligraphic_G (16a)
[𝐌SG𝒑tSG]i=pjSG[𝐌SG𝒒tSG]i=qjSG,i𝒢,j[G]formulae-sequencesubscriptdelimited-[]superscript𝐌SGsuperscriptsubscript𝒑𝑡SG𝑖subscriptsuperscript𝑝SG𝑗subscriptdelimited-[]superscript𝐌SGsuperscriptsubscript𝒒𝑡SG𝑖subscriptsuperscript𝑞SG𝑗formulae-sequencefor-all𝑖𝒢for-all𝑗delimited-[]𝐺\displaystyle[\mathbf{M}^{\text{SG}}\bm{p}_{t}^{\text{SG}}]_{i}=p^{\text{SG}}_% {j}~{}~{}~{}[\mathbf{M}^{\text{SG}}\bm{q}_{t}^{\text{SG}}]_{i}=q^{\text{SG}}_{% j},~{}~{}\forall i\in\mathcal{G},\forall j\in[G][ bold_M start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_p start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT [ bold_M start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_q start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , ∀ italic_i ∈ caligraphic_G , ∀ italic_j ∈ [ italic_G ] (16b)

IV-A2 DC-PF

DC-PF constraints represent the linear relaxations of AC-PF, which are commonly included in optimal power generation dispatch and electricity market considerations.

State

The voltage and reactive power are overlooked in DC-PF.

𝒔tDC(ϑt,𝒑tLoad)subscriptsuperscript𝒔DC𝑡subscriptbold-italic-ϑ𝑡subscriptsuperscript𝒑Load𝑡\bm{s}^{\text{DC}}_{t}\triangleq\left(\bm{\vartheta}_{t},\bm{p}^{\text{Load}}_% {t}\right)bold_italic_s start_POSTSUPERSCRIPT DC end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ ( bold_italic_ϑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT Load end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (17)
Action

The action involves only the generation or consumption of active power.

𝒂tDC(𝒑tSG,𝒑tRES,𝒑ch,tBESS,𝒑dis,tBESS)subscriptsuperscript𝒂DC𝑡subscriptsuperscript𝒑SG𝑡subscriptsuperscript𝒑RES𝑡subscriptsuperscript𝒑BESSch𝑡subscriptsuperscript𝒑BESSdis𝑡\bm{a}^{\text{DC}}_{t}\triangleq\left(\bm{p}^{\text{SG}}_{t},\bm{p}^{\text{RES% }}_{t},\bm{p}^{\text{BESS}}_{\text{ch},t},\bm{p}^{\text{BESS}}_{\text{dis},t}\right)bold_italic_a start_POSTSUPERSCRIPT DC end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ ( bold_italic_p start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT RES end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch , italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis , italic_t end_POSTSUBSCRIPT ) (18)
Reward

The reward is similar with the AC-PF (14).

Constraint

The DC-PF constraints are a simplification of the AC-PF constraints, retaining only the active power components and disregarding voltage issues [90].

𝐌BESS𝒑dis,tBESS𝐌BESS𝒑ch,tBESS+𝐌SG𝒑tSG+superscript𝐌BESSsubscriptsuperscript𝒑BESSdis𝑡superscript𝐌BESSsubscriptsuperscript𝒑BESSch𝑡limit-fromsuperscript𝐌SGsuperscriptsubscript𝒑𝑡SG\displaystyle\mathbf{M}^{\text{BESS}}\bm{p}^{\text{BESS}}_{\text{dis},t}-% \mathbf{M}^{\text{BESS}}\bm{p}^{\text{BESS}}_{\text{ch},t}+\mathbf{M}^{\text{% SG}}\bm{p}_{t}^{\text{SG}}+bold_M start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT bold_italic_p start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis , italic_t end_POSTSUBSCRIPT - bold_M start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT bold_italic_p start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch , italic_t end_POSTSUBSCRIPT + bold_M start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT + (19a)
𝐌RES𝒑tRES𝒑tLoad=𝐁ϑtsuperscript𝐌RESsuperscriptsubscript𝒑𝑡RESsubscriptsuperscript𝒑Load𝑡𝐁subscriptbold-italic-ϑ𝑡\displaystyle\mathbf{M}^{\text{RES}}\bm{p}_{t}^{\text{RES}}-\bm{p}^{\text{Load% }}_{t}=\mathbf{B}\bm{\vartheta}_{t}bold_M start_POSTSUPERSCRIPT RES end_POSTSUPERSCRIPT bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT RES end_POSTSUPERSCRIPT - bold_italic_p start_POSTSUPERSCRIPT Load end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_B bold_italic_ϑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
𝒑¯SG𝒑tSG𝒑¯SG𝒑¯RES𝒑tRES𝒑¯RESsuperscript¯𝒑SGsubscriptsuperscript𝒑SG𝑡superscript¯𝒑SGsuperscript¯𝒑RESsubscriptsuperscript𝒑RES𝑡superscript¯𝒑RES\displaystyle\underline{\bm{p}}^{\text{SG}}\leq\bm{p}^{\text{SG}}_{t}\leq% \overline{\bm{p}}^{\text{SG}}~{}~{}~{}\underline{\bm{p}}^{\text{RES}}\leq\bm{p% }^{\text{RES}}_{t}\leq\overline{\bm{p}}^{\text{RES}}under¯ start_ARG bold_italic_p end_ARG start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT ≤ bold_italic_p start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ over¯ start_ARG bold_italic_p end_ARG start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT under¯ start_ARG bold_italic_p end_ARG start_POSTSUPERSCRIPT RES end_POSTSUPERSCRIPT ≤ bold_italic_p start_POSTSUPERSCRIPT RES end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ over¯ start_ARG bold_italic_p end_ARG start_POSTSUPERSCRIPT RES end_POSTSUPERSCRIPT (19b)
|pij|p¯ijsubscript𝑝𝑖𝑗subscript¯𝑝𝑖𝑗\displaystyle|{p}_{ij}|\leq\overline{p}_{ij}| italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | ≤ over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT (19c)

IV-A3 BESS Constraints

The BESS constraints include charging and discharging constraints, and SoC constraints.

0𝒑ch,tBESS𝒑¯chBESS0𝒑dis,tBESS𝒑¯disBESS0subscriptsuperscript𝒑BESSch𝑡subscriptsuperscript¯𝒑BESSch0subscriptsuperscript𝒑BESSdis𝑡subscriptsuperscript¯𝒑BESSdis\displaystyle 0\leq\bm{p}^{\text{BESS}}_{\text{ch},t}\leq\overline{\bm{p}}^{% \text{BESS}}_{\text{ch}}~{}~{}~{}0\leq\bm{p}^{\text{BESS}}_{\text{dis},t}\leq% \overline{\bm{p}}^{\text{BESS}}_{\text{dis}}0 ≤ bold_italic_p start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch , italic_t end_POSTSUBSCRIPT ≤ over¯ start_ARG bold_italic_p end_ARG start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch end_POSTSUBSCRIPT 0 ≤ bold_italic_p start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis , italic_t end_POSTSUBSCRIPT ≤ over¯ start_ARG bold_italic_p end_ARG start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis end_POSTSUBSCRIPT (20a)
𝑺𝒐𝑪¯BESS𝑺𝒐𝑪tBESS𝑺𝒐𝑪¯BESSsuperscript¯𝑺𝒐𝑪BESS𝑺𝒐subscriptsuperscript𝑪BESS𝑡superscript¯𝑺𝒐𝑪BESS\displaystyle\underline{\bm{SoC}}^{\text{BESS}}\leq\bm{SoC}^{\text{BESS}}_{t}% \leq\overline{\bm{SoC}}^{\text{BESS}}under¯ start_ARG bold_italic_S bold_italic_o bold_italic_C end_ARG start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT ≤ bold_italic_S bold_italic_o bold_italic_C start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ over¯ start_ARG bold_italic_S bold_italic_o bold_italic_C end_ARG start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT (20b)
𝑺𝒐𝑪tBESS=𝑺𝒐𝑪t1BESS+ΔtEcapBESS(ηchBESS𝒑ch,tBESS𝒑dis,tBESSηdisBESS)𝑺𝒐subscriptsuperscript𝑪BESS𝑡𝑺𝒐subscriptsuperscript𝑪BESS𝑡1Δ𝑡superscriptsubscript𝐸capBESSsubscriptsuperscript𝜂BESSchsubscriptsuperscript𝒑BESSch𝑡subscriptsuperscript𝒑BESSdis𝑡subscriptsuperscript𝜂BESSdis\displaystyle\bm{SoC}^{\text{BESS}}_{t}=\bm{SoC}^{\text{BESS}}_{t-1}+\frac{% \Delta t}{E_{\text{cap}}^{\text{BESS}}}\Big{(}\eta^{\text{BESS}}_{\text{ch}}% \bm{p}^{\text{BESS}}_{\text{ch},t}-\frac{\bm{p}^{\text{BESS}}_{\text{dis},t}}{% \eta^{\text{BESS}}_{\text{dis}}}\Big{)}bold_italic_S bold_italic_o bold_italic_C start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_S bold_italic_o bold_italic_C start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + divide start_ARG roman_Δ italic_t end_ARG start_ARG italic_E start_POSTSUBSCRIPT cap end_POSTSUBSCRIPT start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT end_ARG ( italic_η start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch end_POSTSUBSCRIPT bold_italic_p start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch , italic_t end_POSTSUBSCRIPT - divide start_ARG bold_italic_p start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis , italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_η start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis end_POSTSUBSCRIPT end_ARG ) (20c)

IV-B Voltage Control

TABLE III: Safe RL Applications in Voltage Control

Research Problem/ Objective Constraint Cum/Ins Hard/Soft Safety Constraint Techniques Key Features [33] Minimize transmission losses Voltage and other system constraints Ins/Hard Projection layer (III-B) Through an embedded safe policy projection, it is possible to smoothly and effectively limit the action space, thereby preventing any breach of physical constraints. [40] Minimize cost Voltage constraint Ins/Hard Lyapunov stability (III-C) Ensuring that each NN controller satisfies certain Lipschitz constraints to inherently meet these constraints, thus guaranteeing the system maintains exponential stability. [91] Minimize transmission loss Voltage and power flow constraints Ins/Hard Finite iteration projection (III-B) A finite iteration projection algorithm is proposed to guarantee hard constraints by converting a non-convex optimization problem into a finite iteration problem. [52] Minimize the cost of network loss and device switching Voltage and power flow constraints Cum/Hard Safety layer (III-F) A safety layer is added to the policy NN to enhance operational constraint satisfaction for both initial exploration phase and convergence phase. [17] Minimize total network energy loss Voltage deviations Cum/Soft Primal-dual policy (III-A) Each zone has a central control agent that embeds GCNs to improve the decision-making capability. The primal-dual method is used to rigorously satisfy voltage safety constraints. [92] Minimize active power loss Voltage violations Cum/Soft Lagrangian relaxation (III-A) A MACSAC RL algorithm is proposed, which is utilized to train control agents online, eliminating the need for accurate ADN models. [47] Active voltage control SoC of BESSs Ins/Hard Physics-based shielding (III-E) The physics-shielded MATD3 algorithm is proposed, capable of replacing dangerous actions with safe ones as the BESSs approach dangerous SoC. [93] Minimize the ADN power losses and control efforts Voltage and power grid constraints Ins/Hard Safety layer (III-F) A safety layer is directly integrated on top of the DDPG actor network to forecasts changes in constrained states and prevents the violation of operational constraints in ADNs. [94] Minimize the network power loss Nodal voltage constraint Ins/Hard Safety projection (III-B) In the training stage, the safety projection is added to the combined policy to analytically solve an action correction formulation to achieve guaranteed 100% voltage security. [25] Minimize the cost of losses and the device switching Voltage constraint Ins/Soft Lagrangian relaxation (III-A) A safe off-policy DRL, Constrained SAC, is proposed to solve Volt-VAR control problems in a model-free manner. [95] Minimize the total control cost Voltage constraint Ins/Hard Safety projection layer (III-B) By leveraging the underlying grid information, a projection layer is designed to project the reactive power injection into a safe set of nodal voltage magnitudes. [41] Minimize the voltage deviation and control cost Voltage constraint Ins/Hard Lyapunov function (III-C) An explicitly constructed Lyapunov function is utilized to certify stability for all monotone policies without knowledge of the underlying model parameters. [96] Minimize the cost of electricity and BESSs maintenance Voltage constraint and ADN constraints Cum/Soft SAC with safety module A model-free DRL algorithm, integrated with a safety module, is proposed to minimize voltage violations and real power losses, with a design that guarantees no voltage violations occur during the online training. [35] Minimize the total operation costs Physical constraints Cum/Hard CPO (III-B) The voltage control problem is formulated as a CMDP and solved by TRPO and CPO to enable safe exploration. [16] Minimize voltage violations and network losses Voltage bound constraints Cum/Soft Penalty function and RRL (III-H) An adversarial RL algorithm has been developed to train an offline agent that is robust against model mismatches.

Voltage control is designed to ensure the magnitudes of voltage across power networks remain close to nominal values or within an acceptable range. For example, Fig. 9 shows the Volt/Var/Watt curves of voltage control [97]. Instead of directly controlling the active and reactive power injections of smart inverters, some researchers have proposed resetting the Volt/Var/Watt curves to control the voltage profiles [98, 99]. Increasing penetration levels of RESs, such as the large-scale deployment of wind farms in transmission systems and the widespread installation of distributed PVs and EVs in distribution networks, have led to significant changes in power system behavior. Due to the distribution networks typically being radial or distributed in structure and connecting a large number of intermittent and uncertain distributed RESs, voltage management has become more complex and challenging, often leading to voltage violations (either below 0.95 p.u. or above 1.05 p.u.) [100, 101]. Many current studies on voltage regulation utilize a physical model-based optimization/control method, employing convex relaxation techniques like second-order cone programming to simplify AC-PF constraints. This approach allows for efficient resolution using conventional solvers [33, 25, 102]. The application of Safe RL in the area of voltage control is detailed in Table III. According to Table III, we take the smarter inverters of DGs and BESSs as a prime example to summarize the voltage control problem associated with safe RL. The state, action, reward, and constraints of voltage control are shown as follows:

Refer to caption
Figure 9: The left figure shows the Volt/Var/Watt curves and the right is the feasible region of the inverter for any sets of parameters 𝜷isubscript𝜷𝑖\bm{\beta}_{i}bold_italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and where the two regions in blue correspond to the charging and discharging mode indicated by η𝜂\etaitalic_η. It should be noted that for solar panels ηt(i)=0superscriptsubscript𝜂𝑡𝑖0\eta_{t}^{(i)}=0italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = 0, hence, the left region of the inverter is inactive.

IV-B1 Volt/Var Control with AC-PF Constraints

State

The state variables are represented by PMU measurements, with sensors installed at buses denoted by 𝒩PMUsuperscript𝒩PMU\mathcal{N}^{\text{PMU}}caligraphic_N start_POSTSUPERSCRIPT PMU end_POSTSUPERSCRIPT, or AMI measurements, with sensors installed at buses denoted by 𝒩AMIsuperscript𝒩AMI\mathcal{N}^{\text{AMI}}caligraphic_N start_POSTSUPERSCRIPT AMI end_POSTSUPERSCRIPT. Thus, the state variable 𝒔𝒔\bm{s}bold_italic_s is comprehensively defined by:

𝒔PMUsuperscript𝒔PMU\displaystyle\bm{s^{\text{PMU}}}bold_italic_s start_POSTSUPERSCRIPT PMU end_POSTSUPERSCRIPT ((vi)i𝒩PMU,(ii)i𝒩PMU)absentsubscriptsubscript𝑣𝑖𝑖superscript𝒩PMUsubscriptsubscript𝑖𝑖𝑖superscript𝒩PMU\displaystyle\triangleq\left((v_{i})_{i\in\mathcal{N}^{\text{PMU}}},(i_{i})_{i% \in\mathcal{N}^{\text{PMU}}}\right)≜ ( ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i ∈ caligraphic_N start_POSTSUPERSCRIPT PMU end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , ( italic_i start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i ∈ caligraphic_N start_POSTSUPERSCRIPT PMU end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) (21a)
𝒔AMIsuperscript𝒔AMI\displaystyle\bm{s}^{\text{AMI}}bold_italic_s start_POSTSUPERSCRIPT AMI end_POSTSUPERSCRIPT ((|vi|2)i𝒩AMI,(|ii|2)i𝒩AMI,(sap,i)i𝒩AMI)absentsubscriptsuperscriptsubscript𝑣𝑖2𝑖superscript𝒩AMIsubscriptsuperscriptsubscript𝑖𝑖2𝑖superscript𝒩AMIsubscriptsubscript𝑠𝑎𝑝𝑖𝑖superscript𝒩AMI\displaystyle\triangleq\left(({|v_{i}|}^{2})_{i\in\mathcal{N}^{\text{AMI}}},({% |i_{i}|}^{2})_{i\in\mathcal{N}^{\text{AMI}}},(s_{ap,i})_{i\in\mathcal{N}^{% \text{AMI}}}\right)≜ ( ( | italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i ∈ caligraphic_N start_POSTSUPERSCRIPT AMI end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , ( | italic_i start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i ∈ caligraphic_N start_POSTSUPERSCRIPT AMI end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , ( italic_s start_POSTSUBSCRIPT italic_a italic_p , italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i ∈ caligraphic_N start_POSTSUPERSCRIPT AMI end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) (21b)

The system dynamics that depict the environment can be formulated as

𝒔t+1V𝒇(𝒔tV,𝒂tV)subscriptsuperscript𝒔V𝑡1𝒇subscriptsuperscript𝒔V𝑡subscriptsuperscript𝒂V𝑡\displaystyle\bm{s}^{\text{V}}_{t+1}\triangleq\bm{f}(\bm{s}^{\text{V}}_{t},\bm% {a}^{\text{V}}_{t})bold_italic_s start_POSTSUPERSCRIPT V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ≜ bold_italic_f ( bold_italic_s start_POSTSUPERSCRIPT V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUPERSCRIPT V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (22)
Action

The control actions include regulating the DGs, BESSs, and other components.

𝒂tV(𝒑tDG,𝒒tDG,𝒑tBESS,𝒑tother)subscriptsuperscript𝒂V𝑡subscriptsuperscript𝒑DG𝑡subscriptsuperscript𝒒DG𝑡subscriptsuperscript𝒑BESS𝑡subscriptsuperscript𝒑other𝑡\bm{a}^{\text{V}}_{t}\triangleq\left(\bm{p}^{\text{DG}}_{t},\bm{q}^{\text{DG}}% _{t},\bm{p}^{\text{BESS}}_{t},\bm{p}^{\text{other}}_{t}\right)bold_italic_a start_POSTSUPERSCRIPT V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ ( bold_italic_p start_POSTSUPERSCRIPT DG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_q start_POSTSUPERSCRIPT DG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT other end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (23)
Reward

The reward is to maintain the voltage magnitudes close to the nominal value vrefsubscript𝑣refv_{\text{ref}}italic_v start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT (typically 1.0 p.u.):

RV(𝒔,𝒂)=𝒗tvrefsuperscript𝑅V𝒔𝒂normsubscript𝒗𝑡subscript𝑣refR^{\text{V}}(\bm{s},\bm{a})=-\|{\bm{v}_{t}-v_{\text{ref}}}\|italic_R start_POSTSUPERSCRIPT V end_POSTSUPERSCRIPT ( bold_italic_s , bold_italic_a ) = - ∥ bold_italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ∥ (24)

Another kind of reward design is a soft mechanism based on an acceptable range:

RV(𝒔,𝒂)=i𝒩([viv¯]++[v¯vi]+)superscript𝑅V𝒔𝒂subscript𝑖𝒩subscriptdelimited-[]subscript𝑣𝑖¯𝑣subscriptdelimited-[]¯𝑣subscript𝑣𝑖R^{\text{V}}(\bm{s},\bm{a})=-\sum_{i\in\cal N}\big{(}[{v}_{i}-\overline{v}]_{+% }+[\underline{v}-{v}_{i}]_{+}\big{)}italic_R start_POSTSUPERSCRIPT V end_POSTSUPERSCRIPT ( bold_italic_s , bold_italic_a ) = - ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT ( [ italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_v end_ARG ] start_POSTSUBSCRIPT + end_POSTSUBSCRIPT + [ under¯ start_ARG italic_v end_ARG - italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) (25)
Constraint

The constraint for the active and reactive power injections of DGs is given by:

(𝒑DG)2+(𝒒DG)2(𝒔¯apDG)2superscriptsuperscript𝒑DG2superscriptsuperscript𝒒DG2superscriptsuperscriptsubscript¯𝒔apDG2(\bm{p}^{\text{DG}})^{2}+(\bm{q}^{\text{DG}})^{2}\leq(\bar{\bm{s}}_{\text{ap}}% ^{\text{DG}})^{2}( bold_italic_p start_POSTSUPERSCRIPT DG end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( bold_italic_q start_POSTSUPERSCRIPT DG end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ( over¯ start_ARG bold_italic_s end_ARG start_POSTSUBSCRIPT ap end_POSTSUBSCRIPT start_POSTSUPERSCRIPT DG end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (26)

However, [97] points out that the stability regions are more constrained than in Equation (26). For simplicity, we omit the specific equations. Figure 9 illustrates the piece-wise linear equations that constrain the battery system’s active and reactive power injections within the blue feasible region, while the solar panel inverters are only in the right region, as they do not have a discharging process, i.e., p0𝑝0p\geq 0italic_p ≥ 0.

IV-B2 Volt/Var Control with LinDistFlow Constraints

The LinDistFlow linearized branch flow model is applied within a tree-structured distribution network. The system consists of a set of nodes 𝒩+0={0,1,,N}subscript𝒩001𝑁\mathcal{N}_{+0}=\{0,1,\cdots,N\}caligraphic_N start_POSTSUBSCRIPT + 0 end_POSTSUBSCRIPT = { 0 , 1 , ⋯ , italic_N } and an edge set \mathcal{E}caligraphic_E. Node 0 is known as the substation, and 𝒩=𝒩+0/{0}𝒩subscript𝒩00\mathcal{N}=\mathcal{N}_{+0}/\{0\}caligraphic_N = caligraphic_N start_POSTSUBSCRIPT + 0 end_POSTSUBSCRIPT / { 0 } denotes the set of nodes excluding the substation node. Each node i𝒩𝑖𝒩i\in\mathcal{N}italic_i ∈ caligraphic_N is associated with an active power injection pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and a reactive power injection qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Let Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be the squared voltage magnitude, and let p,q𝑝𝑞p,qitalic_p , italic_q and V𝑉Vitalic_V denote {pi,qi,Vi}i𝒩subscriptsubscript𝑝𝑖subscript𝑞𝑖subscript𝑉𝑖𝑖𝒩\{p_{i},q_{i},V_{i}\}_{i\in\mathcal{N}}{ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT stacked into a vector. The variables satisfy the following equations, i𝒩for-all𝑖𝒩\forall i\in\mathcal{N}∀ italic_i ∈ caligraphic_N,

pisubscript𝑝𝑖\displaystyle p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =pji+k:(i,k)pikabsentsubscript𝑝𝑗𝑖subscript:𝑘𝑖𝑘subscript𝑝𝑖𝑘\displaystyle=-p_{ji}+\sum_{k:(i,k)\in\mathcal{E}}p_{ik}= - italic_p start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_k : ( italic_i , italic_k ) ∈ caligraphic_E end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT (27a)
qisubscript𝑞𝑖\displaystyle q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =qji+k:(i,k)qikabsentsubscript𝑞𝑗𝑖subscript:𝑘𝑖𝑘subscript𝑞𝑖𝑘\displaystyle=-q_{ji}+\sum_{k:(i,k)\in\mathcal{E}}q_{ik}= - italic_q start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_k : ( italic_i , italic_k ) ∈ caligraphic_E end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT (27b)
visubscript𝑣𝑖\displaystyle v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =vj2(rijpji+xjiqji)absentsubscript𝑣𝑗2subscript𝑟𝑖𝑗subscript𝑝𝑗𝑖subscript𝑥𝑗𝑖subscript𝑞𝑗𝑖\displaystyle=v_{j}-2(r_{ij}p_{ji}+x_{ji}q_{ji})= italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - 2 ( italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT ) (27c)

where j𝑗jitalic_j is the parent node of i𝑖iitalic_i in the distribution network. (27c) can be written in the vector form:

𝒗=𝐫𝒑+𝐱𝒒+v0𝟏=𝐱𝒒+𝒗env𝒗𝐫𝒑𝐱𝒒subscript𝑣01𝐱𝒒subscript𝒗env\bm{v}=\mathbf{r}\bm{p}+\mathbf{x}\bm{q}+v_{0}\mathbf{1}=\mathbf{x}\bm{q}+\bm{% v}_{\text{env}}bold_italic_v = bold_r bold_italic_p + bold_x bold_italic_q + italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT bold_1 = bold_x bold_italic_q + bold_italic_v start_POSTSUBSCRIPT env end_POSTSUBSCRIPT (28)

where 𝒗env=𝐫𝒑+v0𝟏subscript𝒗env𝐫𝒑subscript𝑣01\bm{v}_{\text{env}}=\mathbf{r}\bm{p}+v_{0}\mathbf{1}bold_italic_v start_POSTSUBSCRIPT env end_POSTSUBSCRIPT = bold_r bold_italic_p + italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT bold_1 represents the component that cannot be controlled; 𝐫=[2rij]N×N𝐫superscriptdelimited-[]2subscript𝑟𝑖𝑗𝑁𝑁\mathbf{r}=[2r_{ij}]^{N\times N}bold_r = [ 2 italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT and 𝐱=[2xij]N×N𝐱superscriptdelimited-[]2subscript𝑥𝑖𝑗𝑁𝑁\mathbf{x}=[2x_{ij}]^{N\times N}bold_x = [ 2 italic_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT are matrices defined correspond to the parameters rijsubscript𝑟𝑖𝑗r_{ij}italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT and xijsubscript𝑥𝑖𝑗x_{ij}italic_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, respectively.

State

The state of LinDistFlow is also determined by PMU and AMI measurements, similar to the AC-PF (21).

Action

The control actions is a map** from the voltage to reactive power, which is defined by:

𝒂tV=Δ𝒒t𝒒t𝒒t+1subscriptsuperscript𝒂V𝑡Δsubscript𝒒𝑡subscript𝒒𝑡subscript𝒒𝑡1\bm{a}^{\text{V}}_{t}=\Delta\bm{q}_{t}\triangleq\bm{q}_{t}-\bm{q}_{t+1}bold_italic_a start_POSTSUPERSCRIPT V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Δ bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_q start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT (29)

The system dynamics can be given as

𝒗t+1=𝐫𝒑+𝐱(𝒒t𝒂tV)+v0𝟏subscript𝒗𝑡1𝐫𝒑𝐱subscript𝒒𝑡subscriptsuperscript𝒂V𝑡subscript𝑣01\bm{v}_{t+1}=\mathbf{r}\bm{p}+\mathbf{x}(\bm{q}_{t}-\bm{a}^{\text{V}}_{t})+v_{% 0}\mathbf{1}bold_italic_v start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = bold_r bold_italic_p + bold_x ( bold_italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_a start_POSTSUPERSCRIPT V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT bold_1 (30)

where 𝒑𝒑\bm{p}bold_italic_p lacks a time subscript because it pertains to a fast-response control mechanism, and the active power injection is assumed to be constant.

Reward

The reward is also designed to keep the voltage close to its nominal value (24) or within its maximum and minimum limits (25).

Constraint

The constraints include maximum and minimum value limits and the stability of the action:

𝒂¯V𝒂tV𝒂¯Vsuperscript¯𝒂Vsubscriptsuperscript𝒂V𝑡superscript¯𝒂V\displaystyle~{}\underline{\bm{a}}^{\text{V}}\leq\bm{a}^{\text{V}}_{t}\leq% \overline{\bm{a}}^{\text{V}}under¯ start_ARG bold_italic_a end_ARG start_POSTSUPERSCRIPT V end_POSTSUPERSCRIPT ≤ bold_italic_a start_POSTSUPERSCRIPT V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ over¯ start_ARG bold_italic_a end_ARG start_POSTSUPERSCRIPT V end_POSTSUPERSCRIPT (31a)
𝒂tVis stabilizingsubscriptsuperscript𝒂V𝑡is stabilizing\displaystyle\bm{a}^{\text{V}}_{t}~{}\text{is stabilizing}bold_italic_a start_POSTSUPERSCRIPT V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is stabilizing (31b)

IV-B3 Safe RL for Voltage Control

In recent years, the integration of DERs such as rooftop solar panels and EVs has led to rapid and unpredictable fluctuations in the generation and load profiles of distribution systems. These fluctuations pose significant challenges in real-time voltage control for distribution grids. Recently, RL has emerged as a powerful approach for addressing model-free nonlinear control problems, generating considerable interest in develo** RL-based controllers to optimize the transient performance of voltage control problems. Safe RL has been effectively implemented to ensure adherence to voltage and transient stability constraints.

In the future, the focus is shifting toward distributed voltage regulation, driven by the limitations of centralized voltage regulation, which requires a central controller and is susceptible to single-point failures and significant communication burdens. Consequently, distributed voltage regulation, which only requires the exchange of local information with neighboring units, has attracted considerable research interest as a promising direction for future development [17].

IV-C Stability Control

TABLE IV: Safe RL Applications in Stability Control

Research Problem/ Objective Constraint Cum/Ins Hard/Soft Safety Constraint Techniques Key Features [56] Preventive control for transmission overload relief Safety, generation, and network constraints Cum/Hard IPO (III-G) The IPO method’s efficacy is boosted by leveraging spatial-temporal correlations in power grid nodal and edge features. [57] Emergency control for under voltage load-shedding Transient voltage stability Cum/Hard Barrier function (III-G) The safe RL method employs a reward function with a time-dependent barrier function that approaches negative infinity as the system state nears the safety bounds. [103] Emergency load-shedding control Rated capacity, current, voltage and others Cum/Soft Lagrangian relaxation (III-A) Two DRL strategies are designed to tackle intricate power system control challenges in a data-driven manner, aiming to preserve power system stability. [104] Transient and steady-state voltage control Reactive power capacity constraints Ins/Hard Lagrangian relaxation (III-A) and barrier function (III-G) Based on the safe gradient flow framework, the design employs a control barrier function to ensure that given dynamics never leave a safe set. [105] Frequency control Operational constraints Cum/Soft Safety model (III-F) A safety model is proposed comprising two parts: one to check if actions meet safety standards, and another to suggest new actions if they don’t. [106] Minimize the control cost Frequency limit Cum/Hard Barrier function (III-G) A novel self-tuning control barrier function is designed to actively compensate the unsafe frequency control strategies under variational safety constraints. [107] Primary frequency control Frequency constraint Ins/Hard Gauge map (III-F) A closed-form gauge map is proposed, which maps NN outputs from unsafe actions to the set of safe actions. [108] Frequency control Operational safety constraints Cum/Soft Lagrangian relaxation (III-A) Safety is considered during the action search process to ensure that various operational constraints are satisfied while the agent interacts with the environment. [39] Primary frequency control Frequency stability constraints Cum/Hard Lyapunov method (III-C) A Lyapunov function is integrated in the structural properties of controllers, guaranteeing local asymptotic stability. An RNN-based framework that incorporates frequency state transition dynamics is used to train controllers. [109] Wide-area dam** control System constraints Ins/Hard Bounded exploratory control The agent uses DNN and DRL to identify and track the dynamics of the system and automatically takes actions to stabilize the system. [110] Minimize large frequency oscillations Mean-variance risk measure Cum/Soft Lagrangian relaxation (III-A) The risk-constrained linear quadratic regulator problem is addressed through dual reformulation into a minimax problem, utilizing a RL method. [111] FACTS setpoint control Physical constraints Cum/Soft Lagrangian relaxation (III-A) Model-based methods may underperform when faced with topology errors. RL improves by interacting with the environment, bypassing the need for updating network parameters.

Power system stability control focuses on decision-making to prevent the system from entering undesired situations, especially to avert large catastrophic faults. Considering the sequence of control actions and contingencies, stability control is divided into two main categories: preventive and emergency control. Preventive security control aims to prepare the system while it is still in normal operation, ensuring it can satisfactorily handle future contingencies. In contrast, emergency control is initiated after contingencies have already occurred, with the objective of controlling the system’s dynamics to minimize consequences [112]. Preventive control and emergency control typically have high time requirements, with emergency control being even more time-sensitive, often requiring actions within tens of milliseconds.

From the perspective of key system variables that can indicate unstable behavior, traditional power system stability issues are classified into rotor angle stability, frequency stability, and voltage stability [113]. Considering the extensive integration of power electronic devices, power system stability issues have further expanded to include resonance stability and converter-driven stability [114]. Due to the complexity of stability issues and the rapidly changing system states, traditional analytical methods may struggle to find solutions and face computational efficiency limitations. However, RL and safe RL can efficiently address these challenges. The details of the applications of safe RL in stability control are shown in Table IV.

IV-C1 Frequency control

State

The state is the frequency ω𝜔\omegaitalic_ω and rotor angle δ𝛿\deltaitalic_δ:

𝒔F(𝝎t,𝜹t)superscript𝒔Fsubscript𝝎𝑡subscript𝜹𝑡\displaystyle\bm{s}^{\text{F}}\triangleq\left(\bm{\omega}_{t},\bm{\delta}_{t}\right)bold_italic_s start_POSTSUPERSCRIPT F end_POSTSUPERSCRIPT ≜ ( bold_italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (32)
Action

The control actions 𝒂tsubscript𝒂𝑡\bm{a}_{t}bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are implemented through the control of active power injections:

𝒂F(𝒑tSG,𝒑tRES,𝒑tLoad)superscript𝒂Fsubscriptsuperscript𝒑SG𝑡subscriptsuperscript𝒑RES𝑡subscriptsuperscript𝒑Load𝑡\bm{a}^{\text{F}}\triangleq\left(\bm{p}^{\text{SG}}_{t},\bm{p}^{\text{RES}}_{t% },\bm{p}^{\text{Load}}_{t}\right)bold_italic_a start_POSTSUPERSCRIPT F end_POSTSUPERSCRIPT ≜ ( bold_italic_p start_POSTSUPERSCRIPT SG end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT RES end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT Load end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (33)
Reward

The reward is to minimize the frequency deviation and control action cost:

RF(𝒔,𝒂)=i𝒩(Δωi+λhi(ui))superscript𝑅F𝒔𝒂subscript𝑖𝒩subscriptnormΔsubscript𝜔𝑖𝜆subscript𝑖subscript𝑢𝑖R^{\text{F}}(\bm{s},\bm{a})=-\sum_{i\in\mathcal{N}}\left(\|\Delta\omega_{i}\|_% {\infty}+\lambda h_{i}(u_{i})\right)italic_R start_POSTSUPERSCRIPT F end_POSTSUPERSCRIPT ( bold_italic_s , bold_italic_a ) = - ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_N end_POSTSUBSCRIPT ( ∥ roman_Δ italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT + italic_λ italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) (34)

where ΔωisubscriptnormΔsubscript𝜔𝑖\|\Delta\omega_{i}\|_{\infty}∥ roman_Δ italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT represents the maximum frequency deviation during the time horizon; the cost function hi(ui)subscript𝑖subscript𝑢𝑖h_{i}(u_{i})italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is a Lipschitz-continuous function; the cost coefficient λ𝜆\lambdaitalic_λ is used to balance the cost of actions relative to the frequency deviations.

Constraint

The system frequency dynamics is given by the swing equation:

δ˙i=ωisubscript˙𝛿𝑖subscript𝜔𝑖\displaystyle\dot{\delta}_{i}=\omega_{i}over˙ start_ARG italic_δ end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (35a)
Miω˙i=piBusDiΔωisubscript𝑀𝑖subscript˙𝜔𝑖subscriptsuperscript𝑝Bus𝑖subscript𝐷𝑖Δsubscript𝜔𝑖\displaystyle M_{i}\dot{\omega}_{i}=p^{\text{Bus}}_{i}-D_{i}\Delta\omega_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over˙ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_p start_POSTSUPERSCRIPT Bus end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Δ italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT aiF(ωi)j=1nBijsin(Δδ)subscriptsuperscript𝑎F𝑖subscript𝜔𝑖superscriptsubscript𝑗1𝑛subscript𝐵𝑖𝑗Δ𝛿\displaystyle-a^{\text{F}}_{i}(\omega_{i})-\sum_{j=1}^{n}B_{ij}\sin{(\Delta% \delta)}- italic_a start_POSTSUPERSCRIPT F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_B start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT roman_sin ( roman_Δ italic_δ ) (35b)

where δ˙˙𝛿\dot{\delta}over˙ start_ARG italic_δ end_ARG and ω˙˙𝜔\dot{\omega}over˙ start_ARG italic_ω end_ARG represent the time derivatives dδ/dt𝑑𝛿𝑑𝑡d\delta/dtitalic_d italic_δ / italic_d italic_t and dω/dt𝑑𝜔𝑑𝑡d\omega/dtitalic_d italic_ω / italic_d italic_t, respectively; j=1nBijsin(Δδ)superscriptsubscript𝑗1𝑛subscript𝐵𝑖𝑗Δ𝛿\sum_{j=1}^{n}B_{ij}\sin(\Delta\delta)∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_B start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT roman_sin ( roman_Δ italic_δ ) denotes the electrical power pe,isubscript𝑝𝑒𝑖p_{e,i}italic_p start_POSTSUBSCRIPT italic_e , italic_i end_POSTSUBSCRIPT at each node i𝑖iitalic_i; the mechanical power pm,isubscript𝑝𝑚𝑖p_{m,i}italic_p start_POSTSUBSCRIPT italic_m , italic_i end_POSTSUBSCRIPT is expressed as piGenωiRisubscriptsuperscript𝑝Gen𝑖subscript𝜔𝑖subscript𝑅𝑖p^{\text{Gen}}_{i}-\frac{\omega_{i}}{R_{i}}italic_p start_POSTSUPERSCRIPT Gen end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - divide start_ARG italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG; the bus power injection piBussubscriptsuperscript𝑝Bus𝑖p^{\text{Bus}}_{i}italic_p start_POSTSUPERSCRIPT Bus end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is defined as piGenpiLoadsubscriptsuperscript𝑝Gen𝑖subscriptsuperscript𝑝Load𝑖p^{\text{Gen}}_{i}-p^{\text{Load}}_{i}italic_p start_POSTSUPERSCRIPT Gen end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_p start_POSTSUPERSCRIPT Load end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Other constraints are:

|pij|subscript𝑝𝑖𝑗\displaystyle|p_{ij}|| italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | p¯ij𝒂F¯𝒂F(𝝎)𝒂F¯absentsubscript¯𝑝𝑖𝑗¯superscript𝒂Fsuperscript𝒂F𝝎¯superscript𝒂F\displaystyle\leq\overline{p}_{ij}~{}~{}~{}\underline{\bm{a}^{\text{F}}}\leq% \bm{a}^{\text{F}}(\bm{\omega})\leq\overline{\bm{a}^{\text{F}}}≤ over¯ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT under¯ start_ARG bold_italic_a start_POSTSUPERSCRIPT F end_POSTSUPERSCRIPT end_ARG ≤ bold_italic_a start_POSTSUPERSCRIPT F end_POSTSUPERSCRIPT ( bold_italic_ω ) ≤ over¯ start_ARG bold_italic_a start_POSTSUPERSCRIPT F end_POSTSUPERSCRIPT end_ARG (36a)
𝒂F(𝝎)is stabilizingsuperscript𝒂F𝝎is stabilizing\displaystyle\bm{a}^{\text{F}}(\bm{\omega})~{}\text{is stabilizing}bold_italic_a start_POSTSUPERSCRIPT F end_POSTSUPERSCRIPT ( bold_italic_ω ) is stabilizing (36b)

where the requirement that 𝒂F(𝝎)superscript𝒂F𝝎\bm{a}^{\text{F}}(\bm{\omega})bold_italic_a start_POSTSUPERSCRIPT F end_POSTSUPERSCRIPT ( bold_italic_ω ) must be stabilizing is defined using various methods, such as Lyapunov Stability [39].

IV-D EV Charging Control

TABLE V: Safe RL Applications in EV Charging Control

Research Problem/ Objective Constraint Cum/Ins Hard/Soft Safety Constraint Techniques Key Features [115] Minimize the EV charging cost Constraints of action, entropy and SoC deviation Cum/Soft Lagrangian relaxation (III-A) A model-free safe DRL algorithm is proposed to optimize real-time EV charging and discharging schedules without requiring accurate information on the arrival and departure times, remaining energy, and real-time electricity prices. [116] Energy management for plug-in hybrid EV Physical constraints of components Cum/Soft Lagrangian relaxation (III-A) By employing Lagrangian relaxation, the optimization for CMDP transforms into an unconstrained dual problem aimed at minimizing energy consumption. [117] Maximize the total profit Limitations of power and demands Cum/Soft Lagrangian relaxation (III-A) A detailed microgrid system is proposed, featuring a large CS, various EVs, V2G capabilities, and the non-linear charging behavior of EVs. [118] Maximize the revenue of electricity selling EV charging constraint Cum/Soft Lagrangian relaxation (III-A) The formulation takes into account the randomness of the EV’s arrival time, departure time, and remaining energy, as well as the real-time electricity price. [34] Smooth out the load profile of a parking lot Constraints of EV charging and bound Ins/Hard Penalty function and projection method (III-B) Two penalty functions are designed: one to ensure the system charges the EV with sufficient energy, and the other to check if an action exceeds the upper bound of the action space. [26] Optimal EV charging control Constraints of EV Ins/Hard Lagrangian relaxation (III-A) and projection (III-B) The primary objective is to optimize the distribution of power within network boundaries by effectively managing power generation, EVs, and ESSs. [119] Minimize the vehicle energy consumption Constraints of battery power bound Ins/Hard Shielding method (III-E) The shield transforms the agent’s desired action into a safe action for the environment. The desired action is only altered if it violates the safety rule embedded in the shield.

The Paris Agreement recognizes EVs as a significant tool for reducing carbon emissions, leading to their widespread and vigorous development by countries globally. EVs’ penetration reached almost 30 million in 2022 and is expected to grow to about 240 million by 2030 in the stated policies scenario, achieving an average annual growth rate of about 30%. Based on this trend, EVs will account for over 10% of the road vehicle fleet by 2030 [120]. However, the stochastic nature of EV charging can introduce unpredictable peak loads and voltage deviations in the power system. To address these issues, demand response for EVs has been proposed to mitigate grid peak loads and charging costs. Further complexity in optimizing charging arises due to the need to factor in current electricity prices and required charging energy for EV charging and discharging. Additionally, the operation of certain EVs in V2G mode, enabling them to sell electricity back to the grid, adds another layer of complexity[121]. To tackle the uncertainty associated with EVs RL and safe RL methods offer promising solutions to train effective charging strategies that achieve state of the art performance [115]. Next, we describe the state, action, reward, and constraints of EV charging control.

State

The states include the SoC of EVs 𝑺𝒐𝑪tEV𝑺𝒐subscriptsuperscript𝑪EV𝑡\bm{SoC}^{\text{EV}}_{t}bold_italic_S bold_italic_o bold_italic_C start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the amount of charge the EVs requires 𝒑d,tEVsubscriptsuperscript𝒑EV𝑑𝑡\bm{p}^{\text{EV}}_{d,t}bold_italic_p start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d , italic_t end_POSTSUBSCRIPT, the parking time of EVs 𝒕pEVsubscriptsuperscript𝒕EV𝑝\bm{t}^{\text{EV}}_{p}bold_italic_t start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, the electricity price for charging from the grid to the EVs 𝚲ch,tEVsubscriptsuperscript𝚲EVch𝑡\bm{\Lambda}^{\text{EV}}_{\text{ch},t}bold_Λ start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch , italic_t end_POSTSUBSCRIPT, the electricity price for selling from the EVs to the grid 𝚲dis,tEVsubscriptsuperscript𝚲EVdis𝑡\bm{\Lambda}^{\text{EV}}_{\text{dis},t}bold_Λ start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis , italic_t end_POSTSUBSCRIPT, power generated by the RESs 𝒑tRESssubscriptsuperscript𝒑RESs𝑡\bm{p}^{\text{RESs}}_{t}bold_italic_p start_POSTSUPERSCRIPT RESs end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, load demand of other loads 𝒑tLoadsubscriptsuperscript𝒑Load𝑡\bm{p}^{\text{Load}}_{t}bold_italic_p start_POSTSUPERSCRIPT Load end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT which determines the state of the grid [117, 115]:

𝒔tEV(𝑺𝒐𝑪tEV,𝒑d,tEV,𝒕pEV,𝚲ch,tEV,𝚲dis,tEV,𝒑tLoad)subscriptsuperscript𝒔EV𝑡𝑺𝒐subscriptsuperscript𝑪EV𝑡subscriptsuperscript𝒑EV𝑑𝑡subscriptsuperscript𝒕EV𝑝subscriptsuperscript𝚲EVch𝑡subscriptsuperscript𝚲EVdis𝑡subscriptsuperscript𝒑Load𝑡\bm{s}^{\text{EV}}_{t}\triangleq\left(\bm{SoC}^{\text{EV}}_{t},\bm{p}^{\text{% EV}}_{d,t},\bm{t}^{\text{EV}}_{p},\bm{\Lambda}^{\text{EV}}_{\text{ch},t},\bm{% \Lambda}^{\text{EV}}_{\text{dis},t},\bm{p}^{\text{Load}}_{t}\right)bold_italic_s start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ ( bold_italic_S bold_italic_o bold_italic_C start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d , italic_t end_POSTSUBSCRIPT , bold_italic_t start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , bold_Λ start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch , italic_t end_POSTSUBSCRIPT , bold_Λ start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis , italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT Load end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (37)
Action

In existing research on EV charging management, the actions are primarily the charging power 𝒑ch,tEVsubscriptsuperscript𝒑EVch𝑡\bm{p}^{\text{EV}}_{\text{ch},t}bold_italic_p start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch , italic_t end_POSTSUBSCRIPT and discharging power 𝒑dis,tEVsubscriptsuperscript𝒑EVdis𝑡\bm{p}^{\text{EV}}_{\text{dis},t}bold_italic_p start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis , italic_t end_POSTSUBSCRIPT [118, 117, 115]:

𝒂tEV(𝒑ch,tEV,𝒑dis,tEV)subscriptsuperscript𝒂EV𝑡subscriptsuperscript𝒑EVch𝑡subscriptsuperscript𝒑EVdis𝑡\bm{a}^{\text{EV}}_{t}\triangleq\left(\bm{p}^{\text{EV}}_{\text{ch},t},\bm{p}^% {\text{EV}}_{\text{dis},t}\right)bold_italic_a start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ ( bold_italic_p start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch , italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis , italic_t end_POSTSUBSCRIPT ) (38)
Reward

The reward includes minimizing the charging cost associated with the time-varying electricity prices, maximizing the revenue from selling electricity from EVs back to the grid, and aligning the SoC closely with the target value [115, 117]:

REV(𝒔,𝒂)superscript𝑅EV𝒔𝒂\displaystyle R^{\text{EV}}(\bm{s},\bm{a})italic_R start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT ( bold_italic_s , bold_italic_a ) =RcostEV+RrevEVRSoCEVabsentsubscriptsuperscript𝑅EVcostsubscriptsuperscript𝑅EVrevsubscriptsuperscript𝑅EV𝑆𝑜𝐶\displaystyle=-R^{\text{EV}}_{\text{cost}}+R^{\text{EV}}_{\text{rev}}-R^{\text% {EV}}_{SoC}= - italic_R start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT cost end_POSTSUBSCRIPT + italic_R start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT rev end_POSTSUBSCRIPT - italic_R start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S italic_o italic_C end_POSTSUBSCRIPT (39a)
RcostEVsubscriptsuperscript𝑅EVcost\displaystyle R^{\text{EV}}_{\text{cost}}italic_R start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT cost end_POSTSUBSCRIPT =𝚲ch,tEV𝒑ch,tEVabsentsubscriptsuperscript𝚲EVch𝑡subscriptsuperscript𝒑EVch𝑡\displaystyle=\bm{\Lambda}^{\text{EV}}_{\text{ch},t}\bm{p}^{\text{EV}}_{\text{% ch},t}= bold_Λ start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch , italic_t end_POSTSUBSCRIPT bold_italic_p start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch , italic_t end_POSTSUBSCRIPT (39b)
RrevEVsubscriptsuperscript𝑅EVrev\displaystyle R^{\text{EV}}_{\text{rev}}italic_R start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT rev end_POSTSUBSCRIPT =𝚲dis,tEV𝒑dis,tEVabsentsubscriptsuperscript𝚲EVdis𝑡subscriptsuperscript𝒑EVdis𝑡\displaystyle=\bm{\Lambda}^{\text{EV}}_{\text{dis},t}\bm{p}^{\text{EV}}_{\text% {dis},t}= bold_Λ start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis , italic_t end_POSTSUBSCRIPT bold_italic_p start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis , italic_t end_POSTSUBSCRIPT (39c)
RSoCEVsubscriptsuperscript𝑅EV𝑆𝑜𝐶\displaystyle R^{\text{EV}}_{SoC}italic_R start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S italic_o italic_C end_POSTSUBSCRIPT =|𝑺𝒐𝑪tEV𝑺𝒐𝑪targetEV|,absent𝑺𝒐subscriptsuperscript𝑪EV𝑡𝑺𝒐subscriptsuperscript𝑪EVtarget\displaystyle=|\bm{SoC}^{\text{EV}}_{t}-\bm{SoC}^{\text{EV}}_{\text{target}}|,= | bold_italic_S bold_italic_o bold_italic_C start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_S bold_italic_o bold_italic_C start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT target end_POSTSUBSCRIPT | , (39d)

where (39b), (39c) and (39d) respectively represent the rewards for electricity charging cost, electricity selling revenue, and EVs charging satisfaction.

Constraint

Generally, EVs act as controllable loads within the electrical grid, with specific requirements for charging. When considering the V2G mode, the modeling of EVs is similar to that of BESS [117]:

0𝒑ch,tEV𝒑¯chEV0𝒑dis,tEV𝒑¯disEV0subscriptsuperscript𝒑EVch𝑡subscriptsuperscript¯𝒑EVch0subscriptsuperscript𝒑EVdis𝑡subscriptsuperscript¯𝒑EVdis\displaystyle 0\leq\bm{p}^{\text{EV}}_{\text{ch},t}\leq\overline{\bm{p}}^{% \text{EV}}_{\text{ch}}~{}~{}~{}0\leq\bm{p}^{\text{EV}}_{\text{dis},t}\leq% \overline{\bm{p}}^{\text{EV}}_{\text{dis}}0 ≤ bold_italic_p start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch , italic_t end_POSTSUBSCRIPT ≤ over¯ start_ARG bold_italic_p end_ARG start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch end_POSTSUBSCRIPT 0 ≤ bold_italic_p start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis , italic_t end_POSTSUBSCRIPT ≤ over¯ start_ARG bold_italic_p end_ARG start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis end_POSTSUBSCRIPT (40a)
𝑺𝒐𝑪¯EV𝑺𝒐𝑪tEV𝑺𝒐𝑪¯EVsuperscript¯𝑺𝒐𝑪EV𝑺𝒐subscriptsuperscript𝑪EV𝑡superscript¯𝑺𝒐𝑪EV\displaystyle\underline{\bm{SoC}}^{\text{EV}}\leq\bm{SoC}^{\text{EV}}_{t}\leq% \overline{\bm{SoC}}^{\text{EV}}under¯ start_ARG bold_italic_S bold_italic_o bold_italic_C end_ARG start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT ≤ bold_italic_S bold_italic_o bold_italic_C start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ over¯ start_ARG bold_italic_S bold_italic_o bold_italic_C end_ARG start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT (40b)
𝑺𝒐𝑪tEV=𝑺𝒐𝑪t1EV+ΔtEcapEV(ηchEV𝒑ch,tEV𝒑dis,tBESSηdisEV)𝑺𝒐subscriptsuperscript𝑪EV𝑡𝑺𝒐subscriptsuperscript𝑪EV𝑡1Δ𝑡superscriptsubscript𝐸capEVsubscriptsuperscript𝜂EVchsubscriptsuperscript𝒑EVch𝑡subscriptsuperscript𝒑BESSdis𝑡subscriptsuperscript𝜂EVdis\displaystyle\bm{SoC}^{\text{EV}}_{t}=\bm{SoC}^{\text{EV}}_{t-1}+\frac{\Delta t% }{E_{\text{cap}}^{\text{EV}}}\Big{(}\eta^{\text{EV}}_{\text{ch}}\bm{p}^{\text{% EV}}_{\text{ch},t}-\frac{\bm{p}^{\text{BESS}}_{\text{dis},t}}{\eta^{\text{EV}}% _{\text{dis}}}\Big{)}bold_italic_S bold_italic_o bold_italic_C start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_S bold_italic_o bold_italic_C start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + divide start_ARG roman_Δ italic_t end_ARG start_ARG italic_E start_POSTSUBSCRIPT cap end_POSTSUBSCRIPT start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT end_ARG ( italic_η start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch end_POSTSUBSCRIPT bold_italic_p start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch , italic_t end_POSTSUBSCRIPT - divide start_ARG bold_italic_p start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis , italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_η start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis end_POSTSUBSCRIPT end_ARG ) (40c)

where (40a) and (40b) indicate the EV constraints on SoC, charging and discharging power, and SoC; (40c) represents the SoC update process of EVs. Also, most EVs require a target SoC at a specified time t𝑡titalic_t:

𝑺𝒐𝑪tEV𝑺𝒐𝑪targetEV𝑺𝒐subscriptsuperscript𝑪EV𝑡𝑺𝒐subscriptsuperscript𝑪EVtarget\bm{SoC}^{\text{EV}}_{t}\geq\bm{SoC}^{\text{EV}}_{\text{target}}bold_italic_S bold_italic_o bold_italic_C start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ bold_italic_S bold_italic_o bold_italic_C start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT target end_POSTSUBSCRIPT (41)

A comprehensive review of the application of safe RL on EV charging control is provided in Table V. In Table V, most objectives focus on reducing EV charging costs, whereas in [34], the emphasis is on peak shaving and valley filling, to smooth the electric net-load profile. In terms of specific safe RL technologies, most papers employ methods based on a Lagrangian relaxation [115, 116, 117, 118, 26]. Exceptions are [34] which utilizes penalty functions, and [119] which adopts the shielding method.

IV-E Building Energy Management

TABLE VI: Safe RL applications in building energy management

Research Problem/ Objective Constraint Cum/Ins Hard/Soft Safety Constraint Techniques Key Features [122] Tropical air free-cooled data center control Constraints of temperature and humidity Cum/Soft Lagrangian relaxation (III-A) By controlling the supply and exhaust fans, the cooling coil, and the dampers, the temperature and relative humidity of the air supplied to the servers are maintained below thresholds. [123] Dynamic thermal management in data center buildings Constraints of equipment temperature Cum+Ins/ Hard+Soft Lagrangian relaxation (III-A) and shielding (III-E) Lagrangian-based constrained DRL and reward sha** are used to minimize soft violations. Parameterized shielding is employed to effectively avoid extreme temperature violations. [48] Data center building cooling Constraints of zone temperature Ins/Hard Shielding method (III-E) Shielding is avoided during training to not impede full exploration. An approach integrating empirical thermodynamics knowledge with data-driven models is proposed. [124] Multi-energy management of smart home Constraints of components in the smart home Cum/Soft PDO (III-A) By employing PDO, the Lagrangian relaxation coefficients for cost functions are automatically adjusted during the training, thereby minimizing both energy bills and the constraint costs. [125] District cooling system control Power constraint Ins/Hard Safety layer (III-F) A model-free DRL method is proposed that operates without needing an accurate system model or uncertainty distribution, utilizing a self-adaptive reward function to limit peak power. [126] Energy savings in building energy systems Constraints of indoor temperature demand Ins/Hard Shielding method (III-E) Implicit and explicit safety policies are combined through online residual learning, enabling real-time safety by filtering out unsafe actions, overcoming the limitations of relying solely on penalty-based rewards. [127] Safe building HVAC control Constraints of building Cum/Soft Safety-aware objective To ensure safe exploration, Gaussian noise is added to a hand-crafted rule-based controller. Adjusting the noise’s variance helps balance the diversity and safety. [128] Resilient proactive scheduling of building Constraints of components of building Cum/Soft Adaptive reward Conditional-value-at-risk are used to handle uncertainties from extreme weather events, significantly reducing their impact on the learning process and achieving a balanced approach between exploration and exploitation. [129] Real-time control in a smart energy-hub Physical constraints of energy hub Cum/Soft Safety-guided function A safety-guided function calculates the action-value function based on accumulated safety, determining the trajectory’s safety under the current policy projected into the future. [130] Optimal dispatch of an energy hub Constraints of energy balance and equipment Cum/Soft Primal-dual method (III-A) The approach blends imitation learning for lower costs and primal-dual optimization to meet constraints, working better than using either method alone.

In 2022, the global buildings sector was a major energy consumer, accounting for 30% of the final energy demand, primarily for operational needs like heating and cooling [131]. Energy hubs, connected to both the electric grid system and the natural gas network, cater to three types of energy demands: electrical, heating, and cooling, by controlling RESs, ESSs, EHPs, GBs, and HVAC systems [130]. Therefore, effective control of cooling or HVAC systems for buildings and energy hubs is necessary. Traditional cooling control relies on feedback control, whereas RL has the ability to self-learn and adapt in uncertain and complex environments, making it widely applied in recent years. Building energy management aims to minimize energy consumption while meeting the constraints of thermal-related equipment, such as HVAC, EHP, GB, and the demands for electricity and heat, as well as environmental constraints like temperature and humidity, as detailed in Table VI.

This review highlights models that demonstrate the integration of HVAC systems with power systems, particularly through safe RL controls. We explore the state, action, reward, and constraints associated with the RL control of HVAC and power systems, providing specific examples within the context of energy management in HVAC as follows:

State

The state of the building, in relation to HVAC systems, includes indoor and outdoor temperature TI/Osuperscript𝑇𝐼𝑂T^{I/O}italic_T start_POSTSUPERSCRIPT italic_I / italic_O end_POSTSUPERSCRIPT, humidity H𝐻Hitalic_H, actual airflow rate 𝒔airsuperscript𝒔air\bm{s}^{\text{air}}bold_italic_s start_POSTSUPERSCRIPT air end_POSTSUPERSCRIPT, actual ventilation rate 𝒔vensuperscript𝒔ven\bm{s}^{\text{ven}}bold_italic_s start_POSTSUPERSCRIPT ven end_POSTSUPERSCRIPT [132]. Additionally, it covers BESS SoC 𝑺𝒐𝑪BESS𝑺𝒐superscript𝑪BESS\bm{SoC}^{\text{BESS}}bold_italic_S bold_italic_o bold_italic_C start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT, TESS SoC 𝑺𝒐𝑪TESS𝑺𝒐superscript𝑪TESS\bm{SoC}^{\text{TESS}}bold_italic_S bold_italic_o bold_italic_C start_POSTSUPERSCRIPT TESS end_POSTSUPERSCRIPT, CHP state 𝒔CHPsuperscript𝒔CHP\bm{s}^{\text{CHP}}bold_italic_s start_POSTSUPERSCRIPT CHP end_POSTSUPERSCRIPT, GB state 𝒔GBsuperscript𝒔GB\bm{s}^{\text{GB}}bold_italic_s start_POSTSUPERSCRIPT GB end_POSTSUPERSCRIPT and EHP state 𝒔EHPsuperscript𝒔EHP\bm{s}^{\text{EHP}}bold_italic_s start_POSTSUPERSCRIPT EHP end_POSTSUPERSCRIPT, and core operational equipment state, like IT equipment temperature TITsuperscript𝑇ITT^{\text{IT}}italic_T start_POSTSUPERSCRIPT IT end_POSTSUPERSCRIPT, and human satisfaction indicators 𝒔Humansuperscript𝒔Human\bm{s}^{\text{Human}}bold_italic_s start_POSTSUPERSCRIPT Human end_POSTSUPERSCRIPT, like thermal comfort index, and exogenous state, like grid electricity prices 𝚲Elesuperscript𝚲Ele\bm{\Lambda}^{\text{Ele}}bold_Λ start_POSTSUPERSCRIPT Ele end_POSTSUPERSCRIPT, grid gas price 𝚲Gassuperscript𝚲Gas\bm{\Lambda}^{\text{Gas}}bold_Λ start_POSTSUPERSCRIPT Gas end_POSTSUPERSCRIPT and carbon price 𝚲Carsuperscript𝚲Car\bm{\Lambda}^{\text{Car}}bold_Λ start_POSTSUPERSCRIPT Car end_POSTSUPERSCRIPT [126, 132, 129].

𝒔tBuilding(TI,TO,H,𝒔air,𝒔ven,𝑺𝒐𝑪BESS,𝑺𝒐𝑪TESS,\displaystyle\bm{s}^{\text{Building}}_{t}\triangleq(T^{I},T^{O},H,\bm{s}^{% \text{air}},\bm{s}^{\text{ven}},\bm{SoC}^{\text{BESS}},\bm{SoC}^{\text{TESS}},bold_italic_s start_POSTSUPERSCRIPT Building end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ ( italic_T start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT , italic_H , bold_italic_s start_POSTSUPERSCRIPT air end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ven end_POSTSUPERSCRIPT , bold_italic_S bold_italic_o bold_italic_C start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT , bold_italic_S bold_italic_o bold_italic_C start_POSTSUPERSCRIPT TESS end_POSTSUPERSCRIPT , (42)
𝒔CHP,𝒔GB,𝒔EHP,TIT,𝒔Human,𝚲Ele,𝚲Gas,𝚲Carsuperscript𝒔CHPsuperscript𝒔GBsuperscript𝒔EHPsuperscript𝑇ITsuperscript𝒔Humansuperscript𝚲Elesuperscript𝚲Gassuperscript𝚲Car\displaystyle\bm{s}^{\text{CHP}},\bm{s}^{\text{GB}},\bm{s}^{\text{EHP}},T^{% \text{IT}},\bm{s}^{\text{Human}},\bm{\Lambda}^{\text{Ele}},\bm{\Lambda}^{\text% {Gas}},\bm{\Lambda}^{\text{Car}}bold_italic_s start_POSTSUPERSCRIPT CHP end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT GB end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT EHP end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT IT end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUPERSCRIPT Human end_POSTSUPERSCRIPT , bold_Λ start_POSTSUPERSCRIPT Ele end_POSTSUPERSCRIPT , bold_Λ start_POSTSUPERSCRIPT Gas end_POSTSUPERSCRIPT , bold_Λ start_POSTSUPERSCRIPT Car end_POSTSUPERSCRIPT ))\displaystyle))
Action

Building energy management for HVAC is primarily achieved through the management of energy control equipment, including temperature setpoint Tsetsubscript𝑇setT_{\text{set}}italic_T start_POSTSUBSCRIPT set end_POSTSUBSCRIPT, humidity setpoint Hsetsubscript𝐻setH_{\text{set}}italic_H start_POSTSUBSCRIPT set end_POSTSUBSCRIPT, airflow rate 𝒂airsuperscript𝒂air\bm{a}^{\text{air}}bold_italic_a start_POSTSUPERSCRIPT air end_POSTSUPERSCRIPT, ventilation rate 𝒂vensuperscript𝒂ven\bm{a}^{\text{ven}}bold_italic_a start_POSTSUPERSCRIPT ven end_POSTSUPERSCRIPT, BESS charge or discharge amount 𝒑ch/disBESSsubscriptsuperscript𝒑BESS𝑐𝑑𝑖𝑠\bm{p}^{\text{BESS}}_{ch/dis}bold_italic_p start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_h / italic_d italic_i italic_s end_POSTSUBSCRIPT, TESS charge or discharge amount 𝒉ch/disTESSsubscriptsuperscript𝒉TESS𝑐𝑑𝑖𝑠\bm{h}^{\text{TESS}}_{ch/dis}bold_italic_h start_POSTSUPERSCRIPT TESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_h / italic_d italic_i italic_s end_POSTSUBSCRIPT, electricity generated by CHP 𝒑CHPsuperscript𝒑CHP\bm{p}^{\text{CHP}}bold_italic_p start_POSTSUPERSCRIPT CHP end_POSTSUPERSCRIPT, heat generated by CHP 𝒉CHPsuperscript𝒉CHP\bm{h}^{\text{CHP}}bold_italic_h start_POSTSUPERSCRIPT CHP end_POSTSUPERSCRIPT, GB 𝒉GBsuperscript𝒉GB\bm{h}^{\text{GB}}bold_italic_h start_POSTSUPERSCRIPT GB end_POSTSUPERSCRIPT and EHP 𝒉EHPsuperscript𝒉EHP\bm{h}^{\text{EHP}}bold_italic_h start_POSTSUPERSCRIPT EHP end_POSTSUPERSCRIPT, and RESs output 𝒑RESsuperscript𝒑RES\bm{p}^{\text{RES}}bold_italic_p start_POSTSUPERSCRIPT RES end_POSTSUPERSCRIPT [127].

𝒂tBuilding(Tset,Hset,𝒂air,𝒂ven,𝒑ch/disBESS,\displaystyle\bm{a}^{\text{Building}}_{t}\triangleq(T_{\text{set}},H_{\text{% set}},\bm{a}^{\text{air}},\bm{a}^{\text{ven}},\bm{p}^{\text{BESS}}_{\text{ch}/% \text{dis}},bold_italic_a start_POSTSUPERSCRIPT Building end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ ( italic_T start_POSTSUBSCRIPT set end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT set end_POSTSUBSCRIPT , bold_italic_a start_POSTSUPERSCRIPT air end_POSTSUPERSCRIPT , bold_italic_a start_POSTSUPERSCRIPT ven end_POSTSUPERSCRIPT , bold_italic_p start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch / dis end_POSTSUBSCRIPT , (43)
𝒉ch/disTESS,𝒑CHP,𝒉CHP,𝒉GB,𝒉EHP,𝒑RESsubscriptsuperscript𝒉TESSchdissuperscript𝒑CHPsuperscript𝒉CHPsuperscript𝒉GBsuperscript𝒉EHPsuperscript𝒑RES\displaystyle\bm{h}^{\text{TESS}}_{\text{ch}/\text{dis}},\bm{p}^{\text{CHP}},% \bm{h}^{\text{CHP}},\bm{h}^{\text{GB}},\bm{h}^{\text{EHP}},\bm{p}^{\text{RES}}bold_italic_h start_POSTSUPERSCRIPT TESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch / dis end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT CHP end_POSTSUPERSCRIPT , bold_italic_h start_POSTSUPERSCRIPT CHP end_POSTSUPERSCRIPT , bold_italic_h start_POSTSUPERSCRIPT GB end_POSTSUPERSCRIPT , bold_italic_h start_POSTSUPERSCRIPT EHP end_POSTSUPERSCRIPT , bold_italic_p start_POSTSUPERSCRIPT RES end_POSTSUPERSCRIPT ))\displaystyle))
Reward

The reward is to minimize the total energy cost, such as the cost of electricity, natural gas, heat, and device long-term degradation, especially for BESSs and TESSs. For some research papers that require specific room temperature ranges, temperature deviations are often included in the reward calculations.

RBuilding(𝒔,𝒂)=(Rcost+Rdegrade+ΔT)superscript𝑅Building𝒔𝒂subscript𝑅costsubscript𝑅degradeΔ𝑇R^{\text{Building}}(\bm{s},\bm{a})=-(R_{\text{cost}}+R_{\text{degrade}}+\Delta T)italic_R start_POSTSUPERSCRIPT Building end_POSTSUPERSCRIPT ( bold_italic_s , bold_italic_a ) = - ( italic_R start_POSTSUBSCRIPT cost end_POSTSUBSCRIPT + italic_R start_POSTSUBSCRIPT degrade end_POSTSUBSCRIPT + roman_Δ italic_T ) (44)

where three components represent the rewards for cost, device degradation, and temperature deviation, respectively.

Constraint

The generation and consumption of electrical and thermal energy are equal, complying with the electrical and thermal balance equations [129, 124].

𝒑tGrid+𝒑tRESs+𝒑dis,tBESS+𝒑tCHP=subscriptsuperscript𝒑Grid𝑡subscriptsuperscript𝒑RESs𝑡subscriptsuperscript𝒑BESSdis𝑡subscriptsuperscript𝒑CHP𝑡absent\displaystyle\bm{p}^{\text{Grid}}_{t}+\bm{p}^{\text{RESs}}_{t}+\bm{p}^{\text{% BESS}}_{\text{dis},t}+\bm{p}^{\text{CHP}}_{t}=bold_italic_p start_POSTSUPERSCRIPT Grid end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_italic_p start_POSTSUPERSCRIPT RESs end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_italic_p start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis , italic_t end_POSTSUBSCRIPT + bold_italic_p start_POSTSUPERSCRIPT CHP end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =
𝒑tHVAC+𝒑tLoad+𝒑tEV+𝒑ch,tBESS+𝒑tEHPsubscriptsuperscript𝒑HVAC𝑡subscriptsuperscript𝒑Load𝑡subscriptsuperscript𝒑EV𝑡subscriptsuperscript𝒑BESSch𝑡subscriptsuperscript𝒑EHP𝑡\displaystyle\bm{p}^{\text{HVAC}}_{t}+\bm{p}^{\text{Load}}_{t}+\bm{p}^{\text{% EV}}_{t}+\bm{p}^{\text{BESS}}_{\text{ch},t}+\bm{p}^{\text{EHP}}_{t}bold_italic_p start_POSTSUPERSCRIPT HVAC end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_italic_p start_POSTSUPERSCRIPT Load end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_italic_p start_POSTSUPERSCRIPT EV end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_italic_p start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch , italic_t end_POSTSUBSCRIPT + bold_italic_p start_POSTSUPERSCRIPT EHP end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (45a)
𝒉tCHP+𝒉tGB+𝒉dis,tTESS+𝒉tEHP=𝒉tTL+𝒉ch,tTESSsubscriptsuperscript𝒉CHP𝑡subscriptsuperscript𝒉GB𝑡subscriptsuperscript𝒉TESSdis𝑡subscriptsuperscript𝒉EHP𝑡subscriptsuperscript𝒉TL𝑡subscriptsuperscript𝒉TESSch𝑡\displaystyle\bm{h}^{\text{CHP}}_{t}+\bm{h}^{\text{GB}}_{t}+\bm{h}^{\text{TESS% }}_{\text{dis},t}+\bm{h}^{\text{EHP}}_{t}=\bm{h}^{\text{TL}}_{t}+\bm{h}^{\text% {TESS}}_{\text{ch},t}bold_italic_h start_POSTSUPERSCRIPT CHP end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_italic_h start_POSTSUPERSCRIPT GB end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_italic_h start_POSTSUPERSCRIPT TESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis , italic_t end_POSTSUBSCRIPT + bold_italic_h start_POSTSUPERSCRIPT EHP end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_h start_POSTSUPERSCRIPT TL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_italic_h start_POSTSUPERSCRIPT TESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch , italic_t end_POSTSUBSCRIPT (45b)

The constraints of BESS have already been shown in (20). The constraints of TESS are similar to BESS:

𝟎𝒉ch,t𝒉¯ch,tTESS𝟎𝒉dis,tTESS𝒉¯disTESS0subscript𝒉ch𝑡subscriptsuperscript¯𝒉TESSch𝑡0subscriptsuperscript𝒉TESSdis𝑡subscriptsuperscript¯𝒉TESSdis\displaystyle\bm{0}\leq\bm{h}_{\text{ch},t}\leq\overline{\bm{h}}^{\text{TESS}}% _{\text{ch},t}~{}~{}~{}\bm{0}\leq\bm{h}^{\text{TESS}}_{\text{dis},t}\leq% \overline{\bm{h}}^{\text{TESS}}_{\text{dis}}bold_0 ≤ bold_italic_h start_POSTSUBSCRIPT ch , italic_t end_POSTSUBSCRIPT ≤ over¯ start_ARG bold_italic_h end_ARG start_POSTSUPERSCRIPT TESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch , italic_t end_POSTSUBSCRIPT bold_0 ≤ bold_italic_h start_POSTSUPERSCRIPT TESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis , italic_t end_POSTSUBSCRIPT ≤ over¯ start_ARG bold_italic_h end_ARG start_POSTSUPERSCRIPT TESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis end_POSTSUBSCRIPT (46a)
𝑺𝒐𝑪¯TESS𝑺𝒐𝑪tTESS𝑺𝒐𝑪¯TESSsuperscript¯𝑺𝒐𝑪TESS𝑺𝒐subscriptsuperscript𝑪TESS𝑡superscript¯𝑺𝒐𝑪TESS\displaystyle\underline{\bm{SoC}}^{\text{TESS}}\leq\bm{SoC}^{\text{TESS}}_{t}% \leq\overline{\bm{SoC}}^{\text{TESS}}under¯ start_ARG bold_italic_S bold_italic_o bold_italic_C end_ARG start_POSTSUPERSCRIPT TESS end_POSTSUPERSCRIPT ≤ bold_italic_S bold_italic_o bold_italic_C start_POSTSUPERSCRIPT TESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ over¯ start_ARG bold_italic_S bold_italic_o bold_italic_C end_ARG start_POSTSUPERSCRIPT TESS end_POSTSUPERSCRIPT (46b)
𝑺𝒐𝑪tTESS=𝑺𝒐𝑪t1TESS+ΔtEcapTESS(ηchTESS𝒉ch,tTESS𝒉dis,tTESSηdisTESS)𝑺𝒐subscriptsuperscript𝑪TESS𝑡𝑺𝒐subscriptsuperscript𝑪TESS𝑡1Δ𝑡superscriptsubscript𝐸capTESSsubscriptsuperscript𝜂TESSchsubscriptsuperscript𝒉TESSch𝑡subscriptsuperscript𝒉TESSdis𝑡subscriptsuperscript𝜂TESSdis\displaystyle\bm{SoC}^{\text{TESS}}_{t}=\bm{SoC}^{\text{TESS}}_{t-1}+\frac{% \Delta t}{E_{\text{cap}}^{\text{TESS}}}\Big{(}\eta^{\text{TESS}}_{\text{ch}}% \bm{h}^{\text{TESS}}_{\text{ch},t}-\frac{\bm{h}^{\text{TESS}}_{\text{dis},t}}{% \eta^{\text{TESS}}_{\text{dis}}}\Big{)}bold_italic_S bold_italic_o bold_italic_C start_POSTSUPERSCRIPT TESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_S bold_italic_o bold_italic_C start_POSTSUPERSCRIPT TESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + divide start_ARG roman_Δ italic_t end_ARG start_ARG italic_E start_POSTSUBSCRIPT cap end_POSTSUBSCRIPT start_POSTSUPERSCRIPT TESS end_POSTSUPERSCRIPT end_ARG ( italic_η start_POSTSUPERSCRIPT TESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch end_POSTSUBSCRIPT bold_italic_h start_POSTSUPERSCRIPT TESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ch , italic_t end_POSTSUBSCRIPT - divide start_ARG bold_italic_h start_POSTSUPERSCRIPT TESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis , italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_η start_POSTSUPERSCRIPT TESS end_POSTSUPERSCRIPT start_POSTSUBSCRIPT dis end_POSTSUBSCRIPT end_ARG ) (46c)

CHP is a single-input-multi-output converter with high electrical and thermal energy efficiency, and its constraints are as follows [129]:

𝒑tCHP=ηpCHP𝒈tCHPsubscriptsuperscript𝒑CHP𝑡subscriptsuperscript𝜂CHP𝑝subscriptsuperscript𝒈CHP𝑡\displaystyle\bm{p}^{\text{CHP}}_{t}=\eta^{\text{CHP}}_{p}\bm{g}^{\text{CHP}}_% {t}~{}~{}~{}bold_italic_p start_POSTSUPERSCRIPT CHP end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_η start_POSTSUPERSCRIPT CHP end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT CHP end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 𝒉hCHP=ηhCHP𝒈tCHPsubscriptsuperscript𝒉CHPsubscriptsuperscript𝜂CHPsubscriptsuperscript𝒈CHP𝑡\displaystyle\bm{h}^{\text{CHP}}_{h}=\eta^{\text{CHP}}_{h}\bm{g}^{\text{CHP}}_% {t}bold_italic_h start_POSTSUPERSCRIPT CHP end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_η start_POSTSUPERSCRIPT CHP end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT CHP end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (47a)
𝟎𝒑tCHP𝒑¯CHP0subscriptsuperscript𝒑CHP𝑡superscript¯𝒑CHP\displaystyle\bm{0}\leq\bm{p}^{\text{CHP}}_{t}\leq\overline{\bm{p}}^{\text{CHP% }}~{}~{}~{}bold_0 ≤ bold_italic_p start_POSTSUPERSCRIPT CHP end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ over¯ start_ARG bold_italic_p end_ARG start_POSTSUPERSCRIPT CHP end_POSTSUPERSCRIPT 𝟎𝒉hCHP𝒉¯CHP0subscriptsuperscript𝒉CHPsuperscript¯𝒉CHP\displaystyle\bm{0}\leq\bm{h}^{\text{CHP}}_{h}\leq\overline{\bm{h}}^{\text{CHP}}bold_0 ≤ bold_italic_h start_POSTSUPERSCRIPT CHP end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ≤ over¯ start_ARG bold_italic_h end_ARG start_POSTSUPERSCRIPT CHP end_POSTSUPERSCRIPT (47b)

where (47a) indicates the efficiency of converting natural gas into electric power 𝒑tCHPsubscriptsuperscript𝒑CHP𝑡\bm{p}^{\text{CHP}}_{t}bold_italic_p start_POSTSUPERSCRIPT CHP end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and heat power 𝒉hCHPsubscriptsuperscript𝒉CHP\bm{h}^{\text{CHP}}_{h}bold_italic_h start_POSTSUPERSCRIPT CHP end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT; (47b) represents the range of 𝒑tCHPsubscriptsuperscript𝒑CHP𝑡\bm{p}^{\text{CHP}}_{t}bold_italic_p start_POSTSUPERSCRIPT CHP end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝒉hCHPsubscriptsuperscript𝒉CHP\bm{h}^{\text{CHP}}_{h}bold_italic_h start_POSTSUPERSCRIPT CHP end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT.

GB and EHP respectively convert natural gas and electricity into heat to meet the heating demand, which can be represented as follows [130]:

𝒉hGB=ηGB𝒈tGBsubscriptsuperscript𝒉GBsuperscript𝜂GBsubscriptsuperscript𝒈GB𝑡\displaystyle\bm{h}^{\text{GB}}_{h}=\eta^{\text{GB}}\bm{g}^{\text{GB}}_{t}~{}~% {}~{}bold_italic_h start_POSTSUPERSCRIPT GB end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_η start_POSTSUPERSCRIPT GB end_POSTSUPERSCRIPT bold_italic_g start_POSTSUPERSCRIPT GB end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 𝒉tEHP=ηEHP𝒑tEHPsubscriptsuperscript𝒉EHP𝑡superscript𝜂EHPsubscriptsuperscript𝒑EHP𝑡\displaystyle\bm{h}^{\text{EHP}}_{t}=\eta^{\text{EHP}}\bm{p}^{\text{EHP}}_{t}bold_italic_h start_POSTSUPERSCRIPT EHP end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_η start_POSTSUPERSCRIPT EHP end_POSTSUPERSCRIPT bold_italic_p start_POSTSUPERSCRIPT EHP end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (48a)
𝟎𝒉hGB𝒉¯GB0subscriptsuperscript𝒉GBsuperscript¯𝒉GB\displaystyle\bm{0}\leq\bm{h}^{\text{GB}}_{h}\leq\overline{\bm{h}}^{\text{GB}}% ~{}~{}~{}bold_0 ≤ bold_italic_h start_POSTSUPERSCRIPT GB end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ≤ over¯ start_ARG bold_italic_h end_ARG start_POSTSUPERSCRIPT GB end_POSTSUPERSCRIPT 𝟎𝒉tEHP𝒉¯EHP0subscriptsuperscript𝒉EHP𝑡superscript¯𝒉EHP\displaystyle\bm{0}\leq\bm{h}^{\text{EHP}}_{t}\leq\overline{\bm{h}}^{\text{EHP}}bold_0 ≤ bold_italic_h start_POSTSUPERSCRIPT EHP end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ over¯ start_ARG bold_italic_h end_ARG start_POSTSUPERSCRIPT EHP end_POSTSUPERSCRIPT (48b)

where (48a) indicates the conversion of natural gas and electricity to heat with different efficiency; (48b) is the range of 𝒉hGBsubscriptsuperscript𝒉GB\bm{h}^{\text{GB}}_{h}bold_italic_h start_POSTSUPERSCRIPT GB end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and 𝒉tCHPsubscriptsuperscript𝒉CHP𝑡\bm{h}^{\text{CHP}}_{t}bold_italic_h start_POSTSUPERSCRIPT CHP end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

HVAC is an important tool for monitoring and controlling the indoor temperature to keep it within the required range [124, 128]:

TtI=ϵTt1I+(1ϵ)(Tt1OηHVACEt1HVACA)subscriptsuperscript𝑇𝐼𝑡italic-ϵsubscriptsuperscript𝑇𝐼𝑡11italic-ϵsubscriptsuperscript𝑇𝑂𝑡1superscript𝜂HVACsubscriptsuperscript𝐸HVAC𝑡1𝐴\displaystyle T^{I}_{t}=\epsilon T^{I}_{t-1}+(1-\epsilon)\left(T^{O}_{t-1}-% \frac{\eta^{\text{HVAC}}E^{\text{HVAC}}_{t-1}}{A}\right)italic_T start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_ϵ italic_T start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + ( 1 - italic_ϵ ) ( italic_T start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - divide start_ARG italic_η start_POSTSUPERSCRIPT HVAC end_POSTSUPERSCRIPT italic_E start_POSTSUPERSCRIPT HVAC end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_A end_ARG ) (49a)
E¯HVACEtHVACE¯HVACT¯ITtIT¯Isuperscript¯𝐸HVACsubscriptsuperscript𝐸HVAC𝑡superscript¯𝐸HVACsuperscript¯𝑇𝐼subscriptsuperscript𝑇𝐼𝑡superscript¯𝑇𝐼\displaystyle\underline{E}^{\text{HVAC}}\leq E^{\text{HVAC}}_{t}\leq\overline{% E}^{\text{HVAC}}~{}~{}~{}\underline{T}^{I}\leq T^{I}_{t}\leq\overline{T}^{I}under¯ start_ARG italic_E end_ARG start_POSTSUPERSCRIPT HVAC end_POSTSUPERSCRIPT ≤ italic_E start_POSTSUPERSCRIPT HVAC end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ over¯ start_ARG italic_E end_ARG start_POSTSUPERSCRIPT HVAC end_POSTSUPERSCRIPT under¯ start_ARG italic_T end_ARG start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ≤ italic_T start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ over¯ start_ARG italic_T end_ARG start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT (49b)

where EHVACsuperscript𝐸HVACE^{\text{HVAC}}italic_E start_POSTSUPERSCRIPT HVAC end_POSTSUPERSCRIPT denotes the energy consumption of HVAC; (49a) indicates the temperature change of the room; (49b) represents the limits of HVAC energy consumption EtHVACsubscriptsuperscript𝐸HVAC𝑡E^{\text{HVAC}}_{t}italic_E start_POSTSUPERSCRIPT HVAC end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and indoor temperature TtIsubscriptsuperscript𝑇𝐼𝑡T^{I}_{t}italic_T start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

IV-F Other Control Areas

TABLE VII: Safe RL Applications in Other Control Areas

Research Problem/ Objective Constraint Cum/Ins Hard/Soft Safety Constraint Techniques Key Features [133] Optimal scheduling of EV aggregators Constraints of EVs and driver’s energy demand Cum/Soft Lagrangian relaxation (III-A) An L2 norm penalty term is added to form an augmented Lagrangian function, which enhances the convexity and tractability of the CMDP. [134] V2G market Constraints of maximum incentive Cum/Soft Primal-dual theories (III-A) This is the first model-free learning algorithm designed to optimize incentives without knowing how EV users will react. It simultaneously improves load control and user satisfaction. [135] Pricing strategy for real-time congestion management Constraints of CS, operator, and grid Cum/Soft Adaptive constraint cost An adaptive scalability factor is introduced to balance safety and exploration. Then, a constrained cross-entropy method is employed to solve this pricing problem within a continuous action space. [53] Service restoration Power flow and voltage Constraints Cum+Ins/ Hard Safety layer (III-F) and penalty term Imitation learning is utilized to ensure acceptable initial performance. Action clip**, reward sha**, and expert demonstrations are employed to guarantee safe exploration. [136] Critical load restoration Constraints of loads, DERs, ESSs Cum/Soft Primal-dual differentiable programming (III-A) Compared to the traditional RL that uses arbitrarily large unit penalties, the proposed method can achieve better performance, evidenced by a higher objective value. [49] Unit commitment Constraints of scheduling Ins/Hard Clip** (III-E) Clip** of the action space is performed to ensure that uncertainty estimates are reasonable and within appropriate bounds, which are derived from historical data. [137] Reserve scheduling Constraints of voltage, RESs, tie line, and ESSs Cum/Soft Primal-dual method (III-A) ESS is fully utilized through more accurate intraday operation scenario simulations to enhance the system’s peak management and flexibility, reducing the reserve requirements of the main network.

In this section, the applications of safe RL in the electricity market, system restoration, and unit commitment and reserve scheduling are summarized, as detailed in Table VII. The specific state, action, reward, and constraints for each area are presented as follows:

IV-F1 Electricity Market

Electricity markets can promote the participation of users in the grid through dynamic pricing and incentive measures to balance supply and demand, thereby enhancing overall energy efficiency. [135] employs safe RL to formulate dynamic pricing strategies for controlling shiftable loads such as EVs, heating, ventilation, and HVAC systems. While some have used NNs to predict the optimal marginal prices of the OPF, such as in [138], these approaches do not derive a stochastic policy. In this section, although EVs are still involved, we mainly focus on aspects related to pricing and DSO operational costs, whereas Section IV-D primarily addresses the OPF that includes EVs. The state, action, reward, and constraints of electricity markets are shown as follows [133, 135, 134]:

State

The state is the observed status information of CSs and DSO, including the total cost of EV CSs 𝒔costCSsuperscriptsubscript𝒔costCS\bm{s}_{\text{cost}}^{\text{CS}}bold_italic_s start_POSTSUBSCRIPT cost end_POSTSUBSCRIPT start_POSTSUPERSCRIPT CS end_POSTSUPERSCRIPT, the total cost of DSO 𝒔costDSOsuperscriptsubscript𝒔costDSO\bm{s}_{\text{cost}}^{\text{DSO}}bold_italic_s start_POSTSUBSCRIPT cost end_POSTSUBSCRIPT start_POSTSUPERSCRIPT DSO end_POSTSUPERSCRIPT.

𝒔tMarket(𝒔costCS,𝒔costDSO)subscriptsuperscript𝒔Market𝑡superscriptsubscript𝒔costCSsuperscriptsubscript𝒔costDSO\bm{s}^{\text{Market}}_{t}\triangleq\left(\bm{s}_{\text{cost}}^{\text{CS}},\bm% {s}_{\text{cost}}^{\text{DSO}}\right)bold_italic_s start_POSTSUPERSCRIPT Market end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ ( bold_italic_s start_POSTSUBSCRIPT cost end_POSTSUBSCRIPT start_POSTSUPERSCRIPT CS end_POSTSUPERSCRIPT , bold_italic_s start_POSTSUBSCRIPT cost end_POSTSUBSCRIPT start_POSTSUPERSCRIPT DSO end_POSTSUPERSCRIPT ) (50)
Action

The action denotes the incentive electricity price of different EV CSs 𝚲CSsuperscript𝚲CS\bm{\Lambda}^{\text{CS}}bold_Λ start_POSTSUPERSCRIPT CS end_POSTSUPERSCRIPT.

𝒂tMarket(𝚲CS)subscriptsuperscript𝒂Market𝑡superscript𝚲CS\bm{a}^{\text{Market}}_{t}\triangleq\left(\bm{\Lambda}^{\text{CS}}\right)bold_italic_a start_POSTSUPERSCRIPT Market end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ ( bold_Λ start_POSTSUPERSCRIPT CS end_POSTSUPERSCRIPT ) (51)
Reward

The reward is to minimize the cost of EV users and maximize the profits of CSs and DSOs by setting different electricity prices.

RMarket(𝒔,𝒂)=RUser+RCS+RDSOsuperscript𝑅Market𝒔𝒂superscript𝑅Usersuperscript𝑅CSsuperscript𝑅DSOR^{\text{Market}}(\bm{s},\bm{a})=-R^{\text{User}}+R^{\text{CS}}+R^{\text{DSO}}italic_R start_POSTSUPERSCRIPT Market end_POSTSUPERSCRIPT ( bold_italic_s , bold_italic_a ) = - italic_R start_POSTSUPERSCRIPT User end_POSTSUPERSCRIPT + italic_R start_POSTSUPERSCRIPT CS end_POSTSUPERSCRIPT + italic_R start_POSTSUPERSCRIPT DSO end_POSTSUPERSCRIPT (52)
Constraint

The EV model has been shown in section IV-D.

IV-F2 System Restoration

System restoration refers to the process of swiftly recovering load from an impacted state to normal operation following the occurrence of extreme events. [53, 136] generate system restoration strategies through the use of safe RL, either by controlling local DERs or by transferring load to safe areas. The state, action, reward, and constraints of system restoration are shown as follows:

State

The state includes the future renewable energy output forecasting 𝒑t+1RESsuperscriptsubscript𝒑𝑡1RES\bm{p}_{t+1}^{\text{RES}}bold_italic_p start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT RES end_POSTSUPERSCRIPT, past restored loads 𝒑t1Loadsuperscriptsubscript𝒑𝑡1Load\bm{p}_{t-1}^{\text{Load}}bold_italic_p start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT Load end_POSTSUPERSCRIPT, current SoC of the BESSs 𝑺𝒐𝑪tBESS𝑺𝒐superscriptsubscript𝑪𝑡BESS\bm{SoC}_{t}^{\text{BESS}}bold_italic_S bold_italic_o bold_italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT, and remaining reserves of various types of generators 𝒑t¯Gen𝒑tGensuperscript¯subscript𝒑𝑡Gensuperscriptsubscript𝒑𝑡Gen\overline{\bm{p}_{t}}^{\text{Gen}}-\bm{p}_{t}^{\text{Gen}}over¯ start_ARG bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT Gen end_POSTSUPERSCRIPT - bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT Gen end_POSTSUPERSCRIPT.

𝒔tRestoration(𝒑t+1RES,𝒑t1Load,𝑺𝒐𝑪tBESS,𝒑t¯Gen𝒑tGen)subscriptsuperscript𝒔Restoration𝑡superscriptsubscript𝒑𝑡1RESsuperscriptsubscript𝒑𝑡1Load𝑺𝒐superscriptsubscript𝑪𝑡BESSsuperscript¯subscript𝒑𝑡Gensuperscriptsubscript𝒑𝑡Gen\bm{s}^{\text{Restoration}}_{t}\triangleq\left(\bm{p}_{t+1}^{\text{RES}},\bm{p% }_{t-1}^{\text{Load}},\bm{SoC}_{t}^{\text{BESS}},\overline{\bm{p}_{t}}^{\text{% Gen}}-\bm{p}_{t}^{\text{Gen}}\right)bold_italic_s start_POSTSUPERSCRIPT Restoration end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ ( bold_italic_p start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT RES end_POSTSUPERSCRIPT , bold_italic_p start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT Load end_POSTSUPERSCRIPT , bold_italic_S bold_italic_o bold_italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT , over¯ start_ARG bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT Gen end_POSTSUPERSCRIPT - bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT Gen end_POSTSUPERSCRIPT ) (53)
Action

The action includes the restored load 𝒑restored,tLoadsuperscriptsubscript𝒑restored𝑡Load\bm{p}_{\text{restored},t}^{\text{Load}}bold_italic_p start_POSTSUBSCRIPT restored , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT Load end_POSTSUPERSCRIPT, active power output of all kinds of generators 𝒑tGensuperscriptsubscript𝒑𝑡Gen\bm{p}_{t}^{\text{Gen}}bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT Gen end_POSTSUPERSCRIPT and BESSs 𝒑tBESSsuperscriptsubscript𝒑𝑡BESS\bm{p}_{t}^{\text{BESS}}bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT.

𝒂tRestoration(𝒑restored,tLoad,𝒑tGen,𝒑tBESS)subscriptsuperscript𝒂Restoration𝑡superscriptsubscript𝒑restored𝑡Loadsuperscriptsubscript𝒑𝑡Gensuperscriptsubscript𝒑𝑡BESS\bm{a}^{\text{Restoration}}_{t}\triangleq\left(\bm{p}_{\text{restored},t}^{% \text{Load}},\bm{p}_{t}^{\text{Gen}},\bm{p}_{t}^{\text{BESS}}\right)bold_italic_a start_POSTSUPERSCRIPT Restoration end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ ( bold_italic_p start_POSTSUBSCRIPT restored , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT Load end_POSTSUPERSCRIPT , bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT Gen end_POSTSUPERSCRIPT , bold_italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT BESS end_POSTSUPERSCRIPT ) (54)
Reward

The reward is to maximize the sum of restored loads 𝒑restored,tLoadsuperscriptsubscript𝒑restored𝑡Load\sum\bm{p}_{\text{restored},t}^{\text{Load}}∑ bold_italic_p start_POSTSUBSCRIPT restored , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT Load end_POSTSUPERSCRIPT.

RRestoration(𝒔,𝒂)=𝒑restored,tLoadsuperscript𝑅Restoration𝒔𝒂superscriptsubscript𝒑restored𝑡LoadR^{\text{Restoration}}(\bm{s},\bm{a})=\sum\bm{p}_{\text{restored},t}^{\text{% Load}}italic_R start_POSTSUPERSCRIPT Restoration end_POSTSUPERSCRIPT ( bold_italic_s , bold_italic_a ) = ∑ bold_italic_p start_POSTSUBSCRIPT restored , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT Load end_POSTSUPERSCRIPT (55)
Constraint

System restoration requires adherence to fundamental power system operational constraints and equipment constraints, including AC-PF constraints (15), DC-PF constraints (19), BESSs constraints (20), etc., all of which have been detailed above. In addition, it is necessary to add constraints to ensure that the load is restored monotonically:

𝒑restored,tLoad𝒑restored,t+1Loadsuperscriptsubscript𝒑restored𝑡Loadsuperscriptsubscript𝒑restored𝑡1Load\bm{p}_{\text{restored},t}^{\text{Load}}\leq\bm{p}_{\text{restored},t+1}^{% \text{Load}}bold_italic_p start_POSTSUBSCRIPT restored , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT Load end_POSTSUPERSCRIPT ≤ bold_italic_p start_POSTSUBSCRIPT restored , italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT Load end_POSTSUPERSCRIPT (56)

IV-F3 Unit Commitment and Reserve Scheduling

Unit commitment and reserve scheduling are both conducted in the day-ahead market, taking into account future uncertainties, such as those from loads and RESs. [49, 137] utilize safe RL to generate strategies for unit commitment, as well as coordinated strategies for tie-line reserve and energy storage, respectively. The state, action, reward, and constraints of unit commitment and reserve scheduling are shown as follows:

State

The state is the historical and current net load forecasts Phis/preLoadsubscriptsuperscript𝑃Loadhis/preP^{\text{Load}}_{\text{his/pre}}italic_P start_POSTSUPERSCRIPT Load end_POSTSUPERSCRIPT start_POSTSUBSCRIPT his/pre end_POSTSUBSCRIPT, commitment, start-up, and shut-down decisions at the previous stage:

𝒔tReserve(PhisLoad,PpreLoad,𝒖start,t1,𝒖shut,t1,𝒖com,t1)subscriptsuperscript𝒔Reserve𝑡subscriptsuperscript𝑃Loadhissubscriptsuperscript𝑃Loadpresubscript𝒖start𝑡1subscript𝒖shut𝑡1subscript𝒖com𝑡1\bm{s}^{\text{Reserve}}_{t}\triangleq\left(P^{\text{Load}}_{\text{his}},P^{% \text{Load}}_{\text{pre}},\bm{u}_{\text{start},t-1},\bm{u}_{\text{shut},t-1},% \bm{u}_{\text{com},t-1}\right)bold_italic_s start_POSTSUPERSCRIPT Reserve end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ ( italic_P start_POSTSUPERSCRIPT Load end_POSTSUPERSCRIPT start_POSTSUBSCRIPT his end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT Load end_POSTSUPERSCRIPT start_POSTSUBSCRIPT pre end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT start , italic_t - 1 end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT shut , italic_t - 1 end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT com , italic_t - 1 end_POSTSUBSCRIPT ) (57)
Action

The action includes the current commitment, start-up, and shut-down decisions 𝒖start/shut/com,tsubscript𝒖start/shut/com𝑡\bm{u}_{\text{start/shut/com},t}bold_italic_u start_POSTSUBSCRIPT start/shut/com , italic_t end_POSTSUBSCRIPT, power output of generator 𝒑tGensubscriptsuperscript𝒑Gen𝑡\bm{p}^{\text{Gen}}_{t}bold_italic_p start_POSTSUPERSCRIPT Gen end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT:

𝒂tReserve(𝒖start,t,𝒖shut,t,𝒖com,t,𝒑tGen)subscriptsuperscript𝒂Reserve𝑡subscript𝒖start𝑡subscript𝒖shut𝑡subscript𝒖com𝑡subscriptsuperscript𝒑Gen𝑡\bm{a}^{\text{Reserve}}_{t}\triangleq\left(\bm{u}_{\text{start},t},\bm{u}_{% \text{shut},t},\bm{u}_{\text{com},t},\bm{p}^{\text{Gen}}_{t}\right)bold_italic_a start_POSTSUPERSCRIPT Reserve end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ ( bold_italic_u start_POSTSUBSCRIPT start , italic_t end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT shut , italic_t end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT com , italic_t end_POSTSUBSCRIPT , bold_italic_p start_POSTSUPERSCRIPT Gen end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (58)
Reward

The reward is to minimize the overall costs, including the cost of power generation RcostGensubscriptsuperscript𝑅GencostR^{\text{Gen}}_{\text{cost}}italic_R start_POSTSUPERSCRIPT Gen end_POSTSUPERSCRIPT start_POSTSUBSCRIPT cost end_POSTSUBSCRIPT, commitment costs RcostCommitmentsubscriptsuperscript𝑅CommitmentcostR^{\text{Commitment}}_{\text{cost}}italic_R start_POSTSUPERSCRIPT Commitment end_POSTSUPERSCRIPT start_POSTSUBSCRIPT cost end_POSTSUBSCRIPT, and start-up and shut-down costs RcostStart/Shutsubscriptsuperscript𝑅Start/ShutcostR^{\text{Start/Shut}}_{\text{cost}}italic_R start_POSTSUPERSCRIPT Start/Shut end_POSTSUPERSCRIPT start_POSTSUBSCRIPT cost end_POSTSUBSCRIPT:

RReserve(𝒔,𝒂)=(RcostGen+RcostCommitment+RcostStart+RcostShut)superscript𝑅Reserve𝒔𝒂subscriptsuperscript𝑅Gencostsubscriptsuperscript𝑅Commitmentcostsubscriptsuperscript𝑅Startcostsubscriptsuperscript𝑅ShutcostR^{\text{Reserve}}(\bm{s},\bm{a})=-(R^{\text{Gen}}_{\text{cost}}+R^{\text{% Commitment}}_{\text{cost}}+R^{\text{Start}}_{\text{cost}}+R^{\text{Shut}}_{% \text{cost}})italic_R start_POSTSUPERSCRIPT Reserve end_POSTSUPERSCRIPT ( bold_italic_s , bold_italic_a ) = - ( italic_R start_POSTSUPERSCRIPT Gen end_POSTSUPERSCRIPT start_POSTSUBSCRIPT cost end_POSTSUBSCRIPT + italic_R start_POSTSUPERSCRIPT Commitment end_POSTSUPERSCRIPT start_POSTSUBSCRIPT cost end_POSTSUBSCRIPT + italic_R start_POSTSUPERSCRIPT Start end_POSTSUPERSCRIPT start_POSTSUBSCRIPT cost end_POSTSUBSCRIPT + italic_R start_POSTSUPERSCRIPT Shut end_POSTSUPERSCRIPT start_POSTSUBSCRIPT cost end_POSTSUBSCRIPT ) (59)
Constraint
ucom,i,tp¯iGenpi,tGenucom,i,tp¯iGen,i𝒢formulae-sequencesubscript𝑢com𝑖𝑡subscriptsuperscript¯𝑝Gen𝑖subscriptsuperscript𝑝Gen𝑖𝑡subscript𝑢com𝑖𝑡subscriptsuperscript¯𝑝Gen𝑖for-all𝑖𝒢\displaystyle u_{\text{com},i,t}\underline{p}^{\text{Gen}}_{i}\leq p^{\text{% Gen}}_{i,t}\leq u_{\text{com},i,t}\overline{p}^{\text{Gen}}_{i},~{}~{}\forall i% \in\mathcal{G}italic_u start_POSTSUBSCRIPT com , italic_i , italic_t end_POSTSUBSCRIPT under¯ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT Gen end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_p start_POSTSUPERSCRIPT Gen end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ≤ italic_u start_POSTSUBSCRIPT com , italic_i , italic_t end_POSTSUBSCRIPT over¯ start_ARG italic_p end_ARG start_POSTSUPERSCRIPT Gen end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ∀ italic_i ∈ caligraphic_G (60a)
ξ=tt¯up,i+1tustart,i,ξucom,i,t,i𝒢,t{t¯up,i,,ttot}formulae-sequencesuperscriptsubscript𝜉𝑡subscript¯𝑡up𝑖1𝑡subscript𝑢start𝑖𝜉subscript𝑢com𝑖𝑡formulae-sequencefor-all𝑖𝒢𝑡subscript¯𝑡up𝑖subscript𝑡tot\displaystyle\sum_{\xi=t-\underline{t}_{\text{up},i}+1}^{t}u_{\text{start},i,% \xi}\leq u_{\text{com},i,t},~{}~{}\forall i\in\mathcal{G},\,t\in\{\underline{t% }_{\text{up},i},\ldots,t_{\text{tot}}\}∑ start_POSTSUBSCRIPT italic_ξ = italic_t - under¯ start_ARG italic_t end_ARG start_POSTSUBSCRIPT up , italic_i end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT start , italic_i , italic_ξ end_POSTSUBSCRIPT ≤ italic_u start_POSTSUBSCRIPT com , italic_i , italic_t end_POSTSUBSCRIPT , ∀ italic_i ∈ caligraphic_G , italic_t ∈ { under¯ start_ARG italic_t end_ARG start_POSTSUBSCRIPT up , italic_i end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT } (60b)
ξ=tt¯up,i+1tushut,i,ξ1ucom,i,t,i𝒢,t{t¯up,i,,ttot}formulae-sequencesuperscriptsubscript𝜉𝑡subscript¯𝑡up𝑖1𝑡subscript𝑢shut𝑖𝜉1subscript𝑢com𝑖𝑡formulae-sequencefor-all𝑖𝒢𝑡subscript¯𝑡up𝑖subscript𝑡tot\displaystyle\sum_{\xi=t-\overline{t}_{\text{up},i}+1}^{t}u_{\text{shut},i,\xi% }\leq 1-u_{\text{com},i,t},~{}~{}\forall i\in\mathcal{G},\,t\in\{\overline{t}_% {\text{up},i},\ldots,t_{\text{tot}}\}∑ start_POSTSUBSCRIPT italic_ξ = italic_t - over¯ start_ARG italic_t end_ARG start_POSTSUBSCRIPT up , italic_i end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT shut , italic_i , italic_ξ end_POSTSUBSCRIPT ≤ 1 - italic_u start_POSTSUBSCRIPT com , italic_i , italic_t end_POSTSUBSCRIPT , ∀ italic_i ∈ caligraphic_G , italic_t ∈ { over¯ start_ARG italic_t end_ARG start_POSTSUBSCRIPT up , italic_i end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT tot end_POSTSUBSCRIPT } (60c)
ucom,i,tucom,i,t1=ustart,i,tushut,i,t,i𝒢formulae-sequencesubscript𝑢com𝑖𝑡subscript𝑢com𝑖𝑡1subscript𝑢start𝑖𝑡subscript𝑢shut𝑖𝑡for-all𝑖𝒢\displaystyle u_{\text{com},i,t}-u_{\text{com},i,t-1}=u_{\text{start},i,t}-u_{% \text{shut},i,t},~{}~{}\forall i\in\mathcal{G}italic_u start_POSTSUBSCRIPT com , italic_i , italic_t end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT com , italic_i , italic_t - 1 end_POSTSUBSCRIPT = italic_u start_POSTSUBSCRIPT start , italic_i , italic_t end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT shut , italic_i , italic_t end_POSTSUBSCRIPT , ∀ italic_i ∈ caligraphic_G (60d)
ustart,i,t+ushut,i,t1,i𝒢formulae-sequencesubscript𝑢start𝑖𝑡subscript𝑢shut𝑖𝑡1for-all𝑖𝒢\displaystyle u_{\text{start},i,t}+u_{\text{shut},i,t}\leq 1,~{}~{}\forall i% \in\mathcal{G}italic_u start_POSTSUBSCRIPT start , italic_i , italic_t end_POSTSUBSCRIPT + italic_u start_POSTSUBSCRIPT shut , italic_i , italic_t end_POSTSUBSCRIPT ≤ 1 , ∀ italic_i ∈ caligraphic_G (60e)
i𝒢𝒑i,tGenPpre,tLoadi𝒢ucom,i,t𝒑¯e,iGenPres,tsubscript𝑖𝒢subscriptsuperscript𝒑Gen𝑖𝑡subscriptsuperscript𝑃Loadpre𝑡subscript𝑖𝒢subscript𝑢com𝑖𝑡subscriptsuperscript¯𝒑Gen𝑒𝑖subscript𝑃res𝑡\displaystyle\sum_{i\in\mathcal{G}}\bm{p}^{\text{Gen}}_{i,t}\leq P^{\text{Load% }}_{\text{pre},t}~{}~{}~{}\sum_{i\in\mathcal{G}}u_{\text{com},i,t}\overline{% \bm{p}}^{\text{Gen}}_{e,i}\geq P_{\text{res},t}∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_G end_POSTSUBSCRIPT bold_italic_p start_POSTSUPERSCRIPT Gen end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT ≤ italic_P start_POSTSUPERSCRIPT Load end_POSTSUPERSCRIPT start_POSTSUBSCRIPT pre , italic_t end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_G end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT com , italic_i , italic_t end_POSTSUBSCRIPT over¯ start_ARG bold_italic_p end_ARG start_POSTSUPERSCRIPT Gen end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e , italic_i end_POSTSUBSCRIPT ≥ italic_P start_POSTSUBSCRIPT res , italic_t end_POSTSUBSCRIPT (60f)
𝒑tGen𝒑t1Gen𝑹up,t1𝒖com,t1+𝑺up,t𝒖start,tsubscriptsuperscript𝒑Gen𝑡subscriptsuperscript𝒑Gen𝑡1subscript𝑹up𝑡1subscript𝒖com𝑡1subscript𝑺up𝑡subscript𝒖start𝑡\displaystyle\bm{p}^{\text{Gen}}_{t}-\bm{p}^{\text{Gen}}_{t-1}\leq\bm{R}_{% \text{up},t-1}\bm{u}_{\text{com},t-1}+\bm{S}_{\text{up},t}\bm{u}_{\text{start}% ,t}bold_italic_p start_POSTSUPERSCRIPT Gen end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_p start_POSTSUPERSCRIPT Gen end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ≤ bold_italic_R start_POSTSUBSCRIPT up , italic_t - 1 end_POSTSUBSCRIPT bold_italic_u start_POSTSUBSCRIPT com , italic_t - 1 end_POSTSUBSCRIPT + bold_italic_S start_POSTSUBSCRIPT up , italic_t end_POSTSUBSCRIPT bold_italic_u start_POSTSUBSCRIPT start , italic_t end_POSTSUBSCRIPT (60g)
𝒑t1Gen𝒑tGen𝑹down,t𝒖com,t+𝑺down,t𝒖shut,tsubscriptsuperscript𝒑Gen𝑡1subscriptsuperscript𝒑Gen𝑡subscript𝑹down𝑡subscript𝒖com𝑡subscript𝑺down𝑡subscript𝒖shut𝑡\displaystyle\bm{p}^{\text{Gen}}_{t-1}-\bm{p}^{\text{Gen}}_{t}\leq\bm{R}_{% \text{down},t}\bm{u}_{\text{com},t}+\bm{S}_{\text{down},t}\bm{u}_{\text{shut},t}bold_italic_p start_POSTSUPERSCRIPT Gen end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - bold_italic_p start_POSTSUPERSCRIPT Gen end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ bold_italic_R start_POSTSUBSCRIPT down , italic_t end_POSTSUBSCRIPT bold_italic_u start_POSTSUBSCRIPT com , italic_t end_POSTSUBSCRIPT + bold_italic_S start_POSTSUBSCRIPT down , italic_t end_POSTSUBSCRIPT bold_italic_u start_POSTSUBSCRIPT shut , italic_t end_POSTSUBSCRIPT (60h)
ustart,i,t,ushut,i,t,ucom,i,t{0,1},i𝒢formulae-sequencesubscript𝑢start𝑖𝑡subscript𝑢shut𝑖𝑡subscript𝑢com𝑖𝑡01for-all𝑖𝒢\displaystyle u_{\text{start},i,t},u_{\text{shut},i,t},u_{\text{com},i,t}\in\{% 0,1\},~{}~{}\forall i\in\mathcal{G}italic_u start_POSTSUBSCRIPT start , italic_i , italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT shut , italic_i , italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT com , italic_i , italic_t end_POSTSUBSCRIPT ∈ { 0 , 1 } , ∀ italic_i ∈ caligraphic_G (60i)

where (60a) indicates generator limits; (60b) and (60c) represent minimum up-time and down-time constraints; (60d) and (60e) denote the logical relationship between the generator commitment decisions and start-up/shut-down decisions; (60f) indicate the power generation and reserve constraints; (60g) and (60h) represent ramp-up and ramp-down limits of generators; (60i) specifies the integrality requirement of commitment and start-up/shut-down decisions.

Regarding RRL, some researchers have initiated their focus on game-theoretic RL for multistage games (also referred to as dynamic games) between attackers and defenders. This method, grounded in RL, aims to identify optimal attack sequences in pursuit of certain objectives [139, 140] and dynamic internal trading price strategy [141, 142]. Although chance-constrained RRL methods have garnered attention in automatic control [62, 61, 143], and several researchers have explored robust optimization and machine learning for power flow control [144, 145, 146], the realm of chance-constrained RRL for power system control and optimization remains underexplored.

V Challenges and Outlook

The application of safe RL in power systems is still in its infancy, facing a variety of challenges, i.e., scalability, and distributed setting as well as industrial deployment. In addition, we further discuss the potential future research directions.

V-A Challenges in Safe RL

Although the general challenges of RL have already been reviewed in [1], this subsection will explore the unique challenges faced by existing safe RL approaches.

V-A1 Scalability

Real-world power systems encompass a vast number of buses and power lines. For instance, the Eastern Interconnection, a major North American power grid system, has been modeled with over 60,000 buses in certain simulations. Consequently, large-scale multi-agent systems face scalability issues in such environments for two primary reasons. First, the state and action spaces expand dramatically with an increasing number of agents, a phenomenon known as the “curse of dimensionality.” This expansion results in an exponentially increasing search space for optimal actions. Secondly, as the number of buses grows, there is a rapid increase in the number of power flow constraints and other physics-hard constraints. Additionally, some research papers account for security constraints due to demand uncertainty in power systems, which further complicates the constraints in the RL training process. These factors make it challenging for Safe RL to converge to feasible results using stochastic gradient descent methods. One notable method is the use of factored action spaces, which involves decomposing the action space into smaller, manageable components [147]. This approach has been applied successfully in complex environments like StarCraft and Dota 2, showing significant versatility and efficiency in handling combinatorial and continuous control problems. Reduced order polytopal constraints and low order elliptical constraints are employed to approximate complex constraints for handling extensive constraints [148]. This method offers the potential for effectively incorporating extensive constraints in safe RL.

V-A2 Distributed Setting

Alongside this, the improvements in distributed systems and their algorithms have been essential to the rise of deep learning. Some researchers have made a number of advancements in creating multi-agent versions of the learning algorithms and in develo** distributed deep learning systems [149]. These methods have allowed us to scale up the training procedures for these very large-scale systems. This motivates the adoption of distributed structures for DRL, which lets agents converge quickly, and use efficient ways to explore and learn many different things at the same time.

A unique aspect of RL is the way agents actively shape the learning process by interacting with their environment and kee** a record of what they experience. So, DRL uses distributed approaches to create more learning data in a shorter time and to handle multiple learning processes simultaneously. This distributed DRL has been applied in complex power systems tasks like load scheduling [150] and in the management of EVs [151]. What makes safe RL more challenging in a distributed way is its approach to decomposing complex network constraints into smaller, manageable segments. The new challenge is ensuring that distributed safe DRL can reach a consensus on how to split these problems and converge on a solution, satisfying safety constraints throughout the learning process.

V-A3 Industrial Deployment

Current safe RL strategies largely rely on model-based approaches or training on historical data, which present significant challenges upon deployment in industrial settings. In all the papers reviewed in this review, only [126] involves interacting with a real building to train a safe RL model for temperature control, due to its low deployment risk. However, most studies related to the power grid have not lead to technologies used in practical deployment, due to safety concerns and often resort to model-based or data-based methods. The concern is that these methods, while effective in simulated environments or historical data, may not fully capture the complex, dynamic, and uncertain nature of real-world industrial processes. The discrepancies between simulated environments and actual operational conditions can lead to unexpected behaviors or safety violations, as the learned policies may not generalize well to unseen situations. Furthermore, reliance on historical data limits the system’s ability to adapt to novel conditions or operational changes that were not represented in the training set. This necessitates the development of adaptive, robust, and transferable safe RL algorithms that can continuously learn and adjust to new data in real-time, ensuring safety and efficiency in the face of the evolving operational dynamics characteristic of industrial environments.

V-B Future Directions in Safe RL

Regarding the challenges in applying safe RL to power systems, we present several potential future directions below.

V-B1 Exploring Offline Safe RL

DRL algorithms are based on an online learning paradigm, which presents a significant hurdle to their widespread adoption in power systems. In general, such online interaction is not practical, due to the expense (e.g., in robotics, educational agents, or healthcare) and risk (e.g., in autonomous driving, power systems, or healthcare) associated with exploring control actions in a safety-critical system [152]. Even in domains where online interaction is viable, leveraging previously collected data is often preferable —especially in complex domains where effective generalization necessitates extensive datasets.

Safe RL endeavors to achieve a policy that maximizes rewards within defined constraints, demonstrating advantages in meeting safety requirements for real-world applications. Nonetheless, many deep safe RL approaches primarily address safety post-training, neglecting the costs associated with constraint violations during the training phase. The necessity of collecting online interaction samples poses challenges in ensuring training safety, as preventing the agent from executing unsafe behaviors during learning is non-trivial [153]. Although carefully designed correction systems or human interventions can serve as safety mechanisms to filter unsafe actions during training, their application may prove costly due to the low sample efficiency of many RL approaches.

It is important to add that it is reasonable to use a simulation environment as a digital twin to train. In fact, even if it is unavoidable to have discrepancies between the simulations and the real conditions, high-fidelity simulations and numerical optimization that rely on models are already the nuts and bolts of energy management systems and are what guide control actions that are used to manage the grid today. If these models are accurate enough for decision systems used today to optimally select control actions, then it is reasonable to assume that are sufficiently accurate to train optimum policies. This is an important question to address in research since at the moment there is no comprehensive characterization of how the discrepancies between simulated and real environments affect performance and safety.

V-B2 Emphasizing Privacy in the Learning Process

As RL algorithms grow in popularity, so too do concerns about their privacy implications. The value or policy functions released are trained using reward signals and other inputs that often depend on sensitive data. In the domain of power systems, some rewards could inadvertently expose critical measurement data, such as voltage phasors and power demands, which in turn could lead to issues like false data injection. This historical data can potentially be deduced by recursively querying the released functions. One potential research direction is the development of differentially private algorithms for RL, which safeguard reward information from being compromised by techniques such as inverse RL [154]. The issue of privacy becomes even more critical in the offline RL setting, which is arguably more relevant for applications handling sensitive data. For example, in the EV charging domain, online RL necessitates the continual execution of new exploratory policies for each arriving EV, involving sensitive data like arrival and departure times. In contrast, offline RL relies on historical data of EV charging behavior, which can be particularly sensitive [155]. However, these differentially private mechanisms could introduce uncertainty into safety constraints. Concurrently, differentially private AC-PF constrained OPF has been explored, with studies formulating it as robust optimization to ensure the feasibility of these safety constraints [156]. One potential approach is to develop robust formulation training for safe DRL.

V-B3 Integrating Federated Learning Mechanism

To simultaneously address privacy and scalability issues, integrating federated learning into safe DRL could be a viable solution. In practical scenarios, RL faces challenges such as poor agent performance in large action and state spaces due to limited sample exploration and low sample efficiency impacting learning speed. Information exchange between agents can significantly boost learning rates. While distributed and parallel RL algorithms address these issues by centralizing data, parameters, or gradients for model training, this centralization can compromise privacy, leading to agent mistrust and data interception risks [157].

Federated learning, however, enables information exchange without compromising privacy, hel** agents adapt to diverse environments. It also addresses the simulation-reality gap often present in RL; while many RL algorithms depend on pre-training in simulated environments that do not perfectly mirror the real world, FL can amalgamate insights from both to more accurately bridge this gap [158]. Additionally, FL is beneficial when agents only observe partial features, enabling effective aggregation of this limited information. These considerations give rise to the idea of federated safe RL, which merges FL and safe RL within a privacy-preserving framework, adapting safe RL strategies for sequential decision-making tasks.

V-B4 Advancing Convex Insights

Convex optimization is extensively explored for its ability to provide analytical convergence and optimality guarantees, which in turn yield more stable policies. In the context of safe DRL with convex or non-convex constraints, integrating convex insights can enhance these convergence guarantees. Advancing these insights into safe DRL, consider exploring the application of ICNNs. Rather than training a conventional policy that inputs data and outputs control actions, which must adhere to stringent physical constraints, ICNNs offer a promising alternative due to their superior generalization capabilities. This approach bridges the gap between model accuracy and control tractability by constructing networks that are convex relative to their inputs, as detailed by [159] and further applied by [160] to model complex physical systems accurately. Consequently, training an ICNN-based policy can more easily incorporate convex constraints to ensure feasible and safe optimal control actions with performance guarantees.

Additionally, using convex functions to approximate the policy function represents another viable strategy. Here, policy optimization can be formulated as a constrained optimization problem, where both the objective and constraints are initially nonconvex. By creating a series of surrogate convex-constrained optimization problems—substituting nonconvex functions locally with convex quadratic functions derived from policy gradient estimators, as described by [161]—this method allows for the practical application of theoretical insights into operational policies. These strategies underscore the potential of convex optimization techniques in enhancing the robustness and effectiveness of safe DRL algorithms, particularly in applications that demand adherence to strict safety and physical constraints.

V-B5 Develo** LLM-in-the-loop RL

Numerous practical objectives and constraints of power systems, such as those outlined in the security guideline and operation manual, are based on linguistic stipulations and are difficult to model. In actual power system operations, when these constraints are violated, system operators typically need to take corrective actions [81]. Therefore, a human-in-the-loop approach has been proposed, where humans are integrated into the RL iteration process. This involvement allows humans to actively participate in constraint management, thereby enhancing the reliability of RL [162, 163]. Nonetheless, human-in-the-loop is limited by the availability and time constraints of human experts, making it unfeasible for tasks that require extensive amounts of training data or continuous adaptation.

With the advent of LLMs, the possibility of transitioning from human-in-the-loop to LLM-in-the-loop systems emerges as a viable alternative to address the aforementioned challenges. LLMs, with their powerful learning capabilities and vast knowledge based on power system data and linguistic stipulations, can provide consistent, real-time, and potentially unbiased feedback compared to human experts [164]. For example, [81] integrates the GPT LLM into the OPF framework with linguistic rules. This model quantifies natural language stipulations as objectives and constraints within the power system optimization problem for the first time. In the future, leveraging specialized knowledge in the power system domain to train dedicated LLMs will be crucial for extending their application across a broader spectrum of the power system industry. However, challenges remain in how LLMs can efficiently learn from power system knowledge bases, integrate with existing software tools, quantify uncertainties, and ensure the safety of constraints [164].

VI Conclusion

This paper represents the first comprehensive review of the application of safe RL in power systems, addressing pivotal operational tasks including optimal power generation dispatch, voltage control, stability control, EV charging control, electricity markets, service restoration, and unit commitment. In its first part, the paper introduces the foundational concepts of safe RL, including constraint classifications, existing algorithms, benchmarks, and the unique features and limitations of each algorithm. Subsequently, the paper provides a detailed overview of almost all existing studies on safe RL applications within power systems to date. It categorizes these studies according to their application domains, methodically enumerating each paper’s objectives, constraints, implemented safe RL techniques, environment types, and key features. This review establishes a foundation for the advancement of safe RL applications in power systems, providing direction for future research endeavors.

References

  • [1] X. Chen, G. Qu, Y. Tang, S. Low, and N. Li, “Reinforcement learning for selective key applications in power systems: Recent advances and future challenges,” IEEE Trans. Smart Grid, vol. 13, no. 4, pp. 2935–2958, Jul. 2022.
  • [2] L. Vu, T. Vu, T. L. Vu, and A. Srivastava, “Multi-agent deep reinforcement learning for distributed load restoration,” IEEE Trans. Smart Grid, 2023.
  • [3] J. Zhao, F. Li, S. Mukherjee, and C. Sticht, “Deep reinforcement learning-based model-free on-line dynamic multi-microgrid formation to enhance resilience,” IEEE Trans. Smart Grid, vol. 13, no. 4, pp. 2557–2567, Jul. 2022.
  • [4] S. Gu, L. Yang, Y. Du, G. Chen, F. Walter, J. Wang, Y. Yang, and A. Knoll, “A review of safe reinforcement learning: Methods, theory and applications,” arXiv preprint arXiv:2205.10330, 2022.
  • [5] J. Garcıa and F. Fernández, “A comprehensive survey on safe reinforcement learning,” J. Mach. Learn. Res., vol. 16, no. 1, pp. 1437–1480, 2015.
  • [6] J. Li, X. Wang, S. Chen, and D. Yan, “Research and application of safe reinforcement learning in power system,” in Proc. Asia Conf. Power Electr. Eng., 2023, pp. 1977–1982.
  • [7] W. Zhao, T. He, R. Chen, T. Wei, and C. Liu, “State-wise safe reinforcement learning: A survey,” in Proc. Int. Joint Conf. Artif. Intell., 2023.
  • [8] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.   Cambridge, MA, USA: MIT Press, 2018.
  • [9] J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy optimization,” in Proc. Int. Conf. Mach. Learn., vol. 70, no. 10, Aug. 2017, pp. 22–31.
  • [10] H. Li and H. He, “Learning to operate distribution networks with safe deep reinforcement learning,” IEEE Trans. Smart Grid, vol. 13, no. 3, pp. 1860–1872, May 2022.
  • [11] Q. Zhang, K. Dehghanpour, Z. Wang, F. Qiu, and D. Zhao, “Multi-agent safe policy learning for power management of networked microgrids,” IEEE Trans. Smart Grid, vol. 12, no. 2, pp. 1048–1062, Mar. 2020.
  • [12] Y. Ye, H. Wang, P. Chen, Y. Tang, and G. Strbac, “Safe deep reinforcement learning for microgrid energy management in distribution networks with leveraged spatial-temporal perception,” IEEE Trans. Smart Grid, vol. 14, no. 5, pp. 3759–3775, Sep. 2023.
  • [13] Y. Liu, A. Halev, and X. Liu, “Policy learning with constraints in model-free reinforcement learning: A survey,” in Proc. Int. Joint Conf. Artif. Intell., Aug. 2021, pp. 1–8.
  • [14] Z. Yi, Y. Xu, and C. Wu, “Model-free economic dispatch for virtual power plants: An adversarial safe reinforcement learning approach,” IEEE Trans. Power Syst., 2023.
  • [15] W. Liu, P. Zhuang, H. Liang, J. Peng, and Z. Huang, “Distributed economic dispatch in microgrids based on cooperative reinforcement learning,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 6, pp. 2192–2203, Jun. 2018.
  • [16] H. Liu and W. Wu, “Two-stage deep reinforcement learning for inverter-based volt-var control in active distribution networks,” IEEE Trans. Smart Grid, vol. 12, no. 3, pp. 2037–2047, 2020.
  • [17] R. Yan, Q. Xing, and Y. Xu, “Multi agent safe graph reinforcement learning for pv inverter s based real-time de centralized volt/var control in zoned distribution networks,” IEEE Trans. Smart Grid, Jan. 2023.
  • [18] E. Altman, Constrained Markov decision processes.   London, U.K.: Chapman and Hall, Mar. 1999.
  • [19] Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone, “Risk-constrained reinforcement learning with percentile risk criteria,” J. Mach. Learn. Res., vol. 18, no. 167, pp. 1–51, 2018.
  • [20] D. Bertsekas, Convex optimization algorithms.   Athena Scientific, 2015.
  • [21] A. Ray, J. Achiam, and D. Amodei, “Benchmarking safe exploration in deep reinforcement learning,” arXiv preprint arXiv:1910.01708, 2019.
  • [22] S. Gu, J. G. Kuba, M. Wen, R. Chen, Z. Wang, Z. Tian, J. Wang, A. Knoll, and Y. Yang, “Multi-agent constrained policy optimisation,” arXiv preprint arXiv:2110.02793, 2021.
  • [23] A. Stooke, J. Achiam, and P. Abbeel, “Responsive safety in reinforcement learning by PID Lagrangian methods,” in Proc. Int. Conf. Mach. Learn., 2020, pp. 9133–9143.
  • [24] T. Wu, A. Scaglione, and D. Arnold, “Constrained reinforcement learning for predictive control in real-time stochastic dynamic optimal power flow,” IEEE Trans. Power Syst., 2023.
  • [25] W. Wang, N. Yu, Y. Gao, and J. Shi, “Safe off-policy deep reinforcement learning algorithm for volt-var control in power distribution systems,” IEEE Trans. Smart Grid, vol. 11, no. 4, pp. 3008–3018, Jul. 2019.
  • [26] T. Wu, A. Scaglione, A. P. Surani, D. Arnold, and S. Peisert, “Network-constrained reinforcement learning for optimal ev charging control,” in Proc. IEEE Int. Conf. Smart Grid Commun., 2023, pp. 1–6.
  • [27] Z. Yan and Y. Xu, “A hybrid data-driven method for fast solution of security-constrained optimal power flow,” IEEE Trans. Power Syst., vol. 37, no. 6, pp. 4365–4374, Nov. 2022.
  • [28] A. R. Sayed, X. Zhang, Y. Wang, G. Wang, J. Qiu, and C. Wang, “Online operational decision-making for integrated electric-gas systems with safe reinforcement learning,” IEEE Trans. Power Syst., 2023.
  • [29] D. Ding, K. Zhang, T. Basar, and M. Jovanovic, “Natural policy gradient primal-dual method for constrained markov decision processes,” Proc. Adv. Neural Inf. Process. Syst., vol. 33, pp. 8378–8390, 2020.
  • [30] T.-Y. Yang, J. Rosca, K. Narasimhan, and P. J. Ramadge, “Projection-based constrained policy optimization,” in Proc. Int. Conf. Learn. Represent., 2019, pp. 1–24.
  • [31] Y. Zhang, Q. Vuong, and K. Ross, “First order constrained optimization in policy space,” in Proc. Adv. Neural Inf. Process. Syst., vol. 33, 2020, pp. 15 338–15 349.
  • [32] L. Yang, J. Ji, J. Dai, Y. Zhang, P. Li, and G. Pan, “Cup: A conservative update policy algorithm for safe reinforcement learning,” arXiv preprint arXiv:2202.07565, 2022.
  • [33] M. Zhang, G. Guo, S. Magnússon, R. C. Pilawa-Podgurski, and Q. Xu, “Data driven decentralized control of inverter based renewable energy sources using safe guaranteed multi-agent deep reinforcement learning,” IEEE Trans. Sustain. Energy, 2023.
  • [34] Y. Jiang, Q. Ye, B. Sun, Y. Wu, and D. H. Tsang, “Data-driven coordinated charging for electric vehicles with continuous charging rates: A deep policy gradient approach,” IEEE Internet Things J., vol. 9, no. 14, pp. 12 395–12 412, Jul. 2021.
  • [35] W. Wang, N. Yu, J. Shi, and Y. Gao, “Volt-VAR control in power distribution systems with deep reinforcement learning,” in Proc. IEEE Int. Conf. Commun. Control Comput. Technol. Smart Grids, Oct. 2019, pp. 1–7.
  • [36] X. Wang, R. Wang, and Y. Cheng, “Safe reinforcement learning: A survey,” Acta Automatica Sinica, vol. 49, pp. 1–23, 2023.
  • [37] R. Sepulchre, M. Jankovic, and P. V. Kokotovic, Constructive nonlinear control.   Springer Science & Business Media, 2012.
  • [38] T. J. Perkins and A. G. Barto, “Lyapunov design for safe reinforcement learning,” J. Mach. Learn. Res., vol. 3, pp. 803–832, Dec 2002.
  • [39] W. Cui, Y. Jiang, and B. Zhang, “Reinforcement learning for optimal primary frequency control: A lyapunov approach,” IEEE Trans. Power Syst., vol. 38, no. 2, pp. 1676–1688, 2022.
  • [40] W. Cui, J. Li, and B. Zhang, “Decentralized safe reinforcement learning for inverter-based voltage control,” Electric Power Syst. Res., vol. 211, 2022, Art. no. 108609.
  • [41] Y. Shi, G. Qu, S. Low, A. Anandkumar, and A. Wierman, “Stability constrained reinforcement learning for real-time voltage control,” in Proc. Amer. Control Conf., 2022, pp. 2715–2721.
  • [42] C. K. Williams and C. E. Rasmussen, Gaussian processes for machine learning.   Cambridge, MA, USA: MIT Press, 2006, vol. 2, no. 3.
  • [43] A. K. Akametalu, J. F. Fisac, J. H. Gillula, S. Kaynama, M. N. Zeilinger, and C. J. Tomlin, “Reachability-based safe learning with gaussian processes,” in Proc. IEEE Conf. Decis. Control, Dec. 2014, pp. 1424–1431.
  • [44] Y. Sui, A. Gotovos, J. Burdick, and A. Krause, “Safe exploration for optimization with gaussian processes,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 997–1005.
  • [45] A. I. Cowen-Rivers, D. Palenicek, V. Moens, M. A. Abdullah, A. Sootla, J. Wang, and H. Bou-Ammar, “Samba: Safe model-based & active reinforcement learning,” Mach. Learn., vol. 111, no. 1, pp. 173–203, 2022.
  • [46] M. Alshiekh, R. Bloem, R. Ehlers, B. Könighofer, S. Niekum, and U. Topcu, “Safe reinforcement learning via shielding,” in Proc. AAAI Conf. Artif. Intell., vol. 32, no. 1, Apr. 2018, p. 2661–2669.
  • [47] P. Chen, S. Liu, X. Wang, and I. Kamwa, “Physics-shielded multi-agent deep reinforcement learning for safe active voltage control with photovoltaic/battery energy storage systems,” IEEE Trans. Smart Grid, Jul. 2022.
  • [48] Q. Zhang, M. H. B. Mahbod, C.-B. Chng, P.-S. Lee, and C.-K. Chui, “Residual physics and post-posed shielding for safe deep reinforcement learning method,” IEEE Trans. Cybern., 2022.
  • [49] A. Ajagekar and F. You, “Deep reinforcement learning based unit commitment scheduling under load and wind power uncertainty,” IEEE Trans. Sustain. Energy, vol. 14, no. 2, pp. 803–812, Apr. 2023.
  • [50] G. Dalal, K. Dvijotham, M. Vecerik, T. Hester, C. Paduraru, and Y. Tassa, “Safe exploration in continuous action spaces,” arXiv preprint arXiv:1801.08757, 2018.
  • [51] Z. Yi, X. Wang, C. Yang, C. Yang, M. Niu, and W. Yin, “Real-time sequential security-constrained optimal power flow: A hybrid knowledge-data-driven reinforcement learning approach,” IEEE Trans. Power Syst., vol. 39, no. 1, pp. 1664–1680, Jan. 2024.
  • [52] Y. Gao and N. Yu, “Model-augmented safe reinforcement learning for volt-var control in power distribution networks,” Appl. Energy, vol. 313, 2022, Art. no. 118762.
  • [53] Y. Du and D. Wu, “Deep reinforcement learning from demonstrations to assist service restoration in islanded microgrids,” IEEE Trans. Sustain. Energy, vol. 13, no. 2, pp. 1062–1072, Apr. 2022.
  • [54] Y. Wang, S. S. Zhan, R. Jiao, Z. Wang, W. **, Z. Yang, Z. Wang, C. Huang, and Q. Zhu, “Enforcing hard constraints with soft barriers: Safe reinforcement learning in unknown stochastic environments,” in Proc. Int. Conf. Mach. Learn., 2023, pp. 36 593–36 604.
  • [55] Y. Liu, J. Ding, and X. Liu, “IPO: Interior-point policy optimization under constraints,” in Proc. AAAI Conf. Artif. Intell., vol. 34, no. 04, 2020, pp. 4940–4947.
  • [56] H. Cui, Y. Ye, J. Hu, Y. Tang, Z. Lin, and G. Strbac, “Online preventive control for transmission overload relief using safe reinforcement learning with enhanced spatial-temporal awareness,” IEEE Trans. Power Syst., vol. 39, no. 1, pp. 517–532, Jan. 2024.
  • [57] T. L. Vu, S. Mukherjee, R. Huang, and Q. Huang, “Barrier function-based safe reinforcement learning for emergency control of power systems,” in Proc. IEEE Conf. Decis. Control, 2021, pp. 3652–3657.
  • [58] M. Zanon and S. Gros, “Safe reinforcement learning using robust mpc,” IEEE Trans. Autom. Control, vol. 66, no. 8, pp. 3638–3652, 2020.
  • [59] Y. Li, N. Li, H. E. Tseng, A. Girard, D. Filev, and I. Kolmanovsky, “Safe reinforcement learning using robust action governor,” in Proc. Learn. Dyn. Control, 2021, pp. 1093–1104.
  • [60] A. B. Kordabad, R. Wisniewski, and S. Gros, “Safe reinforcement learning using wasserstein distributionally robust mpc and chance constraint,” IEEE Access, vol. 10, pp. 130 058–130 067, 2022.
  • [61] S. Pfrommer, T. Gautam, A. Zhou, and S. Sojoudi, “Safe reinforcement learning with chance-constrained model predictive control,” in Proc. Learn. Dyn. Control, 2022, pp. 291–303.
  • [62] J. Coulson, J. Lygeros, and F. Dörfler, “Distributionally robust chance constrained data-enabled predictive control,” IEEE Trans. Autom. Control, vol. 67, no. 7, pp. 3289–3304, 2021.
  • [63] J. Yu, C. Gehring, F. Schäfer, and A. Anandkumar, “Robust reinforcement learning: A constrained game-theoretic approach,” in Proc. Learn. Dyn. Control, 2021, pp. 1242–1254.
  • [64] A. Rajeswaran, I. Mordatch, and V. Kumar, “A game theoretic framework for model based reinforcement learning,” in Proc. Int. Conf. Mach. Learn., 2020, pp. 7953–7963.
  • [65] A. Asheralieva and D. Niyato, “Hierarchical game-theoretic and reinforcement learning framework for computational offloading in uav-enabled mobile edge computing networks with multiple service providers,” IEEE Internet Things J., vol. 6, no. 5, pp. 8753–8769, 2019.
  • [66] C. Tessler, Y. Efroni, and S. Mannor, “Action robust reinforcement learning and applications in continuous control,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 6215–6224.
  • [67] A. Ray, J. Achiam, and D. Amodei, “Safety-gym: Tools for accelerating safe exploration research,” [Online]. Available: https://github.com/openai/safety-gym, accessed: Feb. 04, 2024.
  • [68] ——, “Safety-starter-agents: Basic constrained RL agents,” [Online]. Available: https://github.com/openai/safety-starter-agents, accessed: Feb. 04, 2024.
  • [69] J. Ji, B. Zhang, J. Zhou, X. Pan, W. Huang, R. Sun, Y. Geng, Y. Zhong, J. Dai, and Y. Yang, “Safety-gymnasium: A unified safe reinforcement learning benchmark,” arXiv preprint arXiv:2310.12567, 2023.
  • [70] ——, “Safety-gymnasium: A unified safe reinforcement learning benchmark,” [Online]. Available: https://github.com/PKU-Alignment/safety-gymnasium, accessed: Feb. 04, 2024.
  • [71] ——, “Safe policy optimization: A benchmark repository for safe reinforcement learning algorithms,” [Online]. Available: https://github.com/PKU-Alignment/Safe-Policy-Optimization, accessed: Feb. 04, 2024.
  • [72] J. Ji, J. Zhou, B. Zhang, J. Dai, X. Pan, R. Sun, W. Huang, Y. Geng, M. Liu, and Y. Yang, “Omnisafe: An infrastructure for accelerating safe reinforcement learning research,” arXiv preprint arXiv:2305.09304, 2023.
  • [73] ——, “Omnisafe: An infrastructural framework for accelerating safe rl research,” [Online]. Available: https://github.com/PKU-Alignment/omnisafe, accessed: Feb. 04, 2024.
  • [74] M. Eichelbeck, H. Markgraf, and M. Althoff, “Contingency-constrained economic dispatch with safe reinforcement learning,” in Proc. IEEE Int. Conf. Mach. Learn. Appl., 2022, pp. 597–602.
  • [75] Y. Ding, X. Chen, and J. Wang, “Deep reinforcement learning-based method for joint optimization of mobile energy storage systems and power grid with high renewable energy sources,” Batteries, vol. 9, no. 4, p. 219, 2023.
  • [76] A. R. Sayed, C. Wang, H. Anis, and T. Bi, “Feasibility constrained online calculation for real-time optimal power flow: A convex constrained deep reinforcement learning approach,” IEEE Trans. Power Syst., vol. 38, no. 6, pp. 5215–5227, Nov. 2023.
  • [77] Y. Chen, Q. Du, H. Liu, L. Cheng, and M. S. Younis, “Improved proximal policy optimization algorithm for sequential security-constrained optimal power flow based on expert knowledge and safety layer,” J. Modern Power Syst. Clean Energy, 2023.
  • [78] J. Zhang, L. Sang, Y. Xu, and H. Sun, “Networked multiagent-based safe reinforcement learning for low-carbon demand management in distribution networks,” IEEE Trans. Sustain. Energy, 2024.
  • [79] H. Li, Z. Wang, L. Li, and H. He, “Online microgrid energy management based on safe deep reinforcement learning,” in IEEE Symp. Ser. Comput. Intell., 2021, pp. 1–8.
  • [80] H. Shengren, P. P. Vergara, E. M. S. Duque, and P. Palensky, “Optimal energy system scheduling using a constraint-aware reinforcement learning algorithm,” Int. J. Electr. Power Energy Syst., vol. 152, 2023, Art. no. 109230.
  • [81] Z. Yan and Y. Xu, “Real-time optimal power flow with linguistic stipulations: Integrating gpt-agent and deep reinforcement learning,” IEEE Trans. Power Syst., 2023.
  • [82] ——, “Real-time optimal power flow: A lagrangian based deep reinforcement learning approach,” IEEE Trans. Power Syst., vol. 35, no. 4, pp. 3270–3273, Jul. 2020.
  • [83] S.-H. Hong and H.-S. Lee, “Robust energy management system with safe reinforcement learning using short-horizon forecasts,” IEEE Trans. Smart Grid, vol. 14, no. 3, pp. 2485–2488, May 2023.
  • [84] G. Ceusters, L. R. Camargo, R. Franke, A. Nowé, and M. Messagie, “Safe reinforcement learning for multi-energy management systems with known constraint functions,” Energy AI, vol. 12, 2023, Art. no. 100227.
  • [85] Y. Wang, D. Qiu, M. Sun, G. Strbac, and Z. Gao, “Secure energy management of multi-energy microgrid: A physical-informed safe reinforcement learning approach,” Appl. Energy, vol. 335, Apr. 2023, Art. no. 120759.
  • [86] B. Kocuk, S. S. Dey, and X. A. Sun, “Strong socp relaxations for the optimal power flow problem,” Oper. Res., vol. 64, no. 6, pp. 1177–1196, 2016.
  • [87] A. Marano-Marcolini, F. Capitanescu, J. L. Martinez-Ramos, and L. Wehenkel, “Exploiting the use of dc scopf approximation to improve iterative ac scopf algorithms,” IEEE Trans. Power Syst., vol. 27, no. 3, pp. 1459–1466, Aug. 2012.
  • [88] M. Yan, M. Shahidehpour, A. Paaso, L. Zhang, A. Alabdulwahab, and A. Abusorrah, “A convex three-stage scopf approach to power system flexibility with unified power flow controllers,” IEEE Trans. Power Syst., vol. 36, no. 3, pp. 1947–1960, May 2021.
  • [89] T. Su, J. Zhao, X. Chen, and X. Liu, “Analytic input convex neural networks-based model predictive control for power system transient stability enhancement,” in Proc. IEEE Power Energy Soc. Gen. Meeting, 2023, pp. 1–5.
  • [90] T. Wu, A. Scaglione, and D. Arnold, “Constrained reinforcement learning for stochastic dynamic optimal power flow control,” in Proc. IEEE Power Energy Soc. Gen. Meeting, 2023, pp. 1–5.
  • [91] M. Zhang, G. Guo, T. Zhao, and Q. Xu, “DNN assisted projection based deep reinforcement learning for safe control of distribution grids,” IEEE Trans. Power Syst., 2023.
  • [92] H. Liu and W. Wu, “Online multi-agent reinforcement learning for decentralized inverter-based volt-var control,” IEEE Trans. Smart Grid, vol. 12, no. 4, pp. 2980–2990, Jul. 2021.
  • [93] P. Kou, D. Liang, C. Wang, Z. Wu, and L. Gao, “Safe deep reinforcement learning-based constrained optimal control scheme for active distribution networks,” Appl. Energy, vol. 264, 2020, Art. no. 114772.
  • [94] G. Guo, M. Zhang, Y. Gong, and Q. Xu, “Safe multi-agent deep reinforcement learning for real-time decentralized control of inverter based renewable energy resources considering communication delay,” Appl. Energy, vol. 349, 2023, Art. no. 121648.
  • [95] Y. Chen, Y. Shi, D. Arnold, and S. Peisert, “Saver: Safe learning-based controller for real-time voltage regulation,” in Proc. IEEE Power Energy Soc. Gen. Meeting, 2022, pp. 1–5.
  • [96] H. T. Nguyen and D.-H. Choi, “Three-stage inverter-based peak shaving and volt-var control in active distribution networks using online safe deep reinforcement learning,” IEEE Trans. Smart Grid, vol. 13, no. 4, pp. 3266–3277, Jul. 2022.
  • [97] I. L. Carreño, A. Scaglione, D. Arnold, and T. Wu, “Voltage security region of a three-phase unbalanced distribution power system with dynamics,” IEEE Trans. Power Syst., 2024.
  • [98] T. Wu, A. Scaglione, and D. Arnold, “Reinforcement learning using physics inspired graph convolutional neural networks,” in 2022 58th Annual Allerton Conference on Communication, Control, and Computing (Allerton).   IEEE, 2022, pp. 1–8.
  • [99] C. Roberts, S.-T. Ngo, A. Milesi, A. Scaglione, S. Peisert, and D. Arnold, “Deep reinforcement learning for mitigating cyber-physical der voltage unbalance attacks,” in 2021 American Control Conference (ACC).   IEEE, 2021, pp. 2861–2867.
  • [100] A. F. Bastos, S. Santoso, V. Krishnan, and Y. Zhang, “Machine learning-based prediction of distribution network voltage and sensor allocation,” in Proc. IEEE Power Energy Soc. Gen. Meeting, 2020, pp. 1–5.
  • [101] Q. Hou, E. Du, N. Zhang, and C. Kang, “Impact of high renewable penetration on the power system operation mode: A data-driven approach,” IEEE Trans. Power Syst., vol. 35, no. 1, pp. 731–741, Jan. 2020.
  • [102] H. Ruan, H. Gao, Y. Liu, L. Wang, and J. Liu, “Distributed voltage control in active distribution network considering renewable energy: A novel network partitioning method,” IEEE Trans. Power Syst., vol. 35, no. 6, pp. 4220–4231, Nov. 2020.
  • [103] H. Zhang, X. Sun, M. H. Lee, and J. Moon, “Deep reinforcement learning based active network management and emergency load-shedding control for power systems,” IEEE Trans. Smart Grid, 2023.
  • [104] J. Feng, W. Cui, J. Cortés, and Y. Shi, “Bridging transient and steady-state performance in voltage control: A reinforcement learning approach with safe gradient flow,” IEEE Control Syst. Lett., 2023.
  • [105] Y. Xia, Y. Xu, Y. Wang, S. Mondal, S. Dasgupta, A. K. Gupta, and G. M. Gupta, “A safe policy learning-based method for decentralized and economic frequency control in isolated networked-microgrid systems,” IEEE Trans. Sustain. Energy, vol. 13, no. 4, pp. 1982–1993, Oct. 2022.
  • [106] X. Wan, M. Sun, B. Chen, Z. Chu, and F. Teng, “Adapsafe: Adaptive and safe-certified deep reinforcement learning-based frequency control for carbon-neutral power systems,” in Proc. AAAI Conf. Artif. Intell., 2023.
  • [107] D. Tabas and B. Zhang, “Computationally efficient safe reinforcement learning for power systems,” in Proc. Amer. Control Conf., 2022, pp. 3303–3310.
  • [108] Y. Zhou, L. Zhou, D. Shi, and X. Zhao, “Coordinated frequency control through safe reinforcement learning,” in Proc. IEEE Power Energy Soc. Gen. Meeting, 2022, pp. 1–5.
  • [109] P. Gupta, A. Pal, and V. Vittal, “Coordinated wide-area dam** control using deep neural networks and reinforcement learning,” IEEE Trans. Power Syst., vol. 37, no. 1, pp. 365–376, Jan. 2022.
  • [110] K.-b. Kwon, S. Mukherjee, T. L. Vu, and H. Zhu, “Risk-constrained reinforcement learning for inverter-dominated power system controls,” IEEE Control Syst. Lett., vol. 7, pp. 3854–3859, 2023.
  • [111] M. Tarle, M. Larsson, G. Ingeström, L. Nordström, and M. Björkman, “Safe reinforcement learning for mitigation of model errors in facts setpoint control,” in Proc. Int. Conf. Smart Energy Syst. Technol., 2023, pp. 1–6.
  • [112] L. Wehenkel and M. Pavella, “Preventive vs. emergency control of power systems,” in Proc. IEEE PES Power Syst. Conf. Expo., 2004, pp. 1665–1670.
  • [113] P. Kundur et al., “Definition and classification of power system stability IEEE/CIGRE joint task force on stability terms and definitions,” IEEE Trans. Power Syst., vol. 19, no. 3, pp. 1387–1401, Aug. 2004.
  • [114] N. Hatziargyriou, J. Milanovic, C. Rahmann, V. Ajjarapu, C. Canizares, I. Erlich, D. Hill, I. Hiskens, I. Kamwa, B. Pal et al., “Definition and classification of power system stability–revisited & extended,” IEEE Trans. Power Syst., vol. 36, no. 4, pp. 3271–3281, Jul. 2021.
  • [115] G. Chen and X. Shi, “A deep reinforcement learning-based charging scheduling approach with augmented lagrangian for electric vehicle,” arXiv preprint arXiv:2209.09772, 2022.
  • [116] H. Zhang, J. Peng, H. Tan, H. Dong, and F. Ding, “A deep reinforcement learning-based energy management framework with lagrangian relaxation for plug-in hybrid electric vehicle,” IEEE Trans. Transport. Electrific., vol. 7, no. 3, pp. 1146–1160, Sep. 2020.
  • [117] S. Zhang, R. Jia, H. Pan, and Y. Cao, “A safe reinforcement learning-based charging strategy for electric vehicles in residential microgrid,” Appl. Energy, vol. 348, Oct. 2023, Art. no. 121490.
  • [118] H. Li, Z. Wan, and H. He, “Constrained ev charging scheduling based on safe deep reinforcement learning,” IEEE Trans. Smart Grid, vol. 11, no. 3, pp. 2427–2439, May 2020.
  • [119] R. Liessner, A. M. Dietermann, and B. Bäker, “Safe deep reinforcement learning hybrid electric vehicle energy management,” in Proc. Int. Conf. Agents Artif. Intell., 2019, pp. 161–181.
  • [120] International Energy Agency, “Global ev outlook 2023: Catching up with climate ambitions,” 2023. [Online]. Available: https://www.iea.org/reports/global-ev-outlook-2023
  • [121] F. Rassaei, W.-S. Soh, and K.-C. Chua, “Demand response for residential electric vehicles with random usage patterns in smart grids,” IEEE Trans. Sustain. Energy, vol. 6, no. 4, pp. 1367–1376, Oct. 2015.
  • [122] D. V. Le, R. Wang, Y. Liu, R. Tan, Y.-W. Wong, and Y. Wen, “Deep reinforcement learning for tropical air free-cooled data center control,” ACM Trans. Sensor Netw., vol. 17, no. 3, pp. 1–28, 2021.
  • [123] Q. Zhang, C.-B. Chng, K. Chen, P.-S. Lee, and C.-K. Chui, “DRL-S: Toward safe real-world learning of dynamic thermal management in data center,” Expert Syst. Appl., vol. 214, 2023, Art. no. 119146.
  • [124] H. Ding, Y. Xu, B. C. S. Hao, Q. Li, and A. Lentzakis, “A safe reinforcement learning approach for multi-energy management of smart home,” Electric Power Syst. Res., vol. 210, 2022, Art. no. 108120.
  • [125] P. Yu, H. Zhang, Y. Song, H. Hui, and G. Chen, “District cooling system control for providing operating reserve based on safe deep reinforcement learning,” IEEE Trans. Power Syst., vol. 39, no. 1, pp. 40–52, 2024.
  • [126] X. Lin, D. Yuan, and X. Li, “Reinforcement learning with dual safety policies for energy savings in building energy systems,” Buildings, vol. 13, no. 3, p. 580, 2023.
  • [127] C. Zhang, S. R. Kuppannagari, and V. K. Prasanna, “Safe building hvac control via batch reinforcement learning,” IEEE Trans. Sustain. Comput., vol. 7, no. 4, pp. 923–934, 2022.
  • [128] Z. Liang, C. Huang, W. Su, N. Duan, V. Donde, B. Wang, and X. Zhao, “Safe reinforcement learning-based resilient proactive scheduling for a commercial building considering correlated demand response,” IEEE Open Access J. Power Energy, vol. 8, pp. 85–96, 2021.
  • [129] D. Qiu, Z. Dong, X. Zhang, Y. Wang, and G. Strbac, “Safe reinforcement learning for real-time automatic control in a smart energy-hub,” Appl. Energy, vol. 309, 2022, Art. no. 118403.
  • [130] A. D. Garmroodi, F. Nasiri, and F. Haghighat, “Optimal dispatch of an energy hub with compressed air energy storage: A safe reinforcement learning approach,” J. Energy Storage, vol. 57, 2023, Art. no. 106147.
  • [131] I. Hamilton, H. Kennard, J. Amorocho, S. Steuwer, J. Kockat, Z. Toth, C. Delmastro, R. M. Gordon, and K. Petrichenko, “Global status report for buildings and construction,” UN Environment Programme, Tech. Rep., 2024.
  • [132] H.-Y. Liu, B. Balaji, S. Gao, R. Gupta, and D. Hong, “Safe hvac control via batch reinforcement learning,” in Proc. ACM/IEEE Int. Conf. Cyber- Phys. Syst., 2022, pp. 181–192.
  • [133] X. Shi, Y. Xu, G. Chen, and Y. Guo, “An augmented lagrangian-based safe reinforcement learning algorithm for carbon-oriented optimal scheduling of ev aggregators,” IEEE Trans. Smart Grid, vol. 15, no. 1, pp. 795–809, Jan. 2024.
  • [134] T. Zhu, X. Zhang, J. Duan, Z. Zhou, and X. Chen, “A budget-aware incentive mechanism for vehicle-to-grid via reinforcement learning,” in Proc. IEEE Int. Symp. Qual. Service, 2023, pp. 1–10.
  • [135] H. Yang, Y. Xu, and Q. Guo, “Dynamic incentive pricing on charging stations for real-time congestion management in distribution network: An adaptive model-based safe deep reinforcement learning method,” IEEE Trans. Sustain. Energy, 2023.
  • [136] X. Zhang, B. Knueven, A. Zamzam, M. Reynolds, and W. Jones, “Primal-dual differentiable programming for distribution system critical load restoration,” in Proc. IEEE Power Energy Soc. Gen. Meeting, 2023, pp. 1–5.
  • [137] X. Li, X. Han, and M. Yang, “Risk-based reserve scheduling for active distribution networks based on an improved proximal policy optimization algorithm,” IEEE Access, vol. 11, pp. 15 211–15 228, 2022.
  • [138] Z. Zhang and M. Wu, “Predicting real-time locational marginal prices: A gan-based approach,” IEEE Transactions on Power Systems, vol. 37, no. 2, pp. 1286–1296, 2021.
  • [139] Z. Ni and S. Paul, “A multistage game in smart grid security: A reinforcement learning solution,” IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 9, pp. 2684–2695, 2019.
  • [140] Y. Guo, L. Wang, Z. Liu, and Y. Shen, “Reinforcement-learning-based dynamic defense strategy of multistage game against dynamic load altering attack,” Int. J. Electr. Power Energy Syst., vol. 131, p. 107113, 2021.
  • [141] V.-H. Bui, A. Hussain, and W. Su, “A dynamic internal trading price strategy for networked microgrids: a deep reinforcement learning-based game-theoretic approach,” IEEE Trans. Smart Grid, vol. 13, no. 5, pp. 3408–3421, 2022.
  • [142] A.-P. Surani, T. Wu, and A. Scaglione, “Competitive reinforcement learning for real-time pricing and scheduling control in coupled ev charging stations and power networks,” in Proc. Int. Conf. Syst. Sci., 2024.
  • [143] B. Peng, J. Duan, J. Chen, S. E. Li, G. Xie, C. Zhang, Y. Guan, Y. Mu, and E. Sun, “Model-based chance-constrained reinforcement learning via separated proportional-integral lagrangian,” IEEE Trans. Neural Netw. Learn. Syst., 2022.
  • [144] A. Hassan, R. Mieth, M. Chertkov, D. Deka, and Y. Dvorkin, “Optimal load ensemble control in chance-constrained optimal power flow,” IEEE Trans. Smart Grid, vol. 10, no. 5, pp. 5186–5195, 2018.
  • [145] O. Ciftci, M. Mehrtash, and A. Kargarian, “Data-driven nonparametric chance-constrained optimization for microgrid energy management,” IEEE Trans. Ind. Inform., vol. 16, no. 4, pp. 2447–2457, 2019.
  • [146] J. Liang, W. Jiang, C. Lu, and C. Wu, “Joint chance-constrained unit commitment: Statistically feasible robust optimization with learning-to-optimize acceleration,” IEEE Trans. Power Syst., 2024.
  • [147] S. Tang, M. Makar, M. Sjoding, F. Doshi-Velez, and J. Wiens, “Leveraging factored action spaces for efficient offline reinforcement learning in healthcare,” in Proc. Adv. Neural Inf. Process. Syst., vol. 35, 2022, pp. 34 272–34 286.
  • [148] K. Hreinsson, A. Scaglione, M. Alizadeh, and Y. Chen, “New insights from the shapley-folkman lemma on dispatchable demand in energy markets,” IEEE Trans. Power Syst., vol. 36, no. 5, pp. 4028–4041, 2021.
  • [149] J. Verbraeken, M. Wolting, J. Katzy, J. Kloppenburg, T. Verbelen, and J. S. Rellermeyer, “A survey on distributed machine learning,” ACM Comput. Surv., vol. 53, no. 2, pp. 1–33, 2020.
  • [150] H.-M. Chung, S. Maharjan, Y. Zhang, and F. Eliassen, “Distributed deep reinforcement learning for intelligent load scheduling in residential smart grids,” IEEE Trans. Ind. Inform., vol. 17, no. 4, pp. 2752–2763, 2020.
  • [151] X. Tang, J. Chen, T. Liu, Y. Qin, and D. Cao, “Distributed deep reinforcement learning-based energy and emission management strategy for hybrid electric vehicles,” IEEE Trans. Veh. Technol., vol. 70, no. 10, pp. 9922–9934, 2021.
  • [152] S. Levine, A. Kumar, G. Tucker, and J. Fu, “Offline reinforcement learning: Tutorial, review, and perspectives on open problems,” arXiv preprint arXiv:2005.01643, 2020.
  • [153] Z. Liu, Z. Guo, Y. Yao, Z. Cen, W. Yu, T. Zhang, and D. Zhao, “Constrained decision transformer for offline safe reinforcement learning,” in Proc. Int. Conf. Mach. Learn., 2023, pp. 21 611–21 630.
  • [154] B. Wang and N. Hegde, “Privacy-preserving q-learning with functional noise in continuous spaces,” in Proc. Adv. Neural Inf. Process. Syst., vol. 32, 2019.
  • [155] D. Qiao and Y.-X. Wang, “Offline reinforcement learning with differential privacy,” in Proc. Adv. Neural Inf. Process. Syst., vol. 36, 2024.
  • [156] V. Dvorkin, F. Fioretto, P. Van Hentenryck, P. Pinson, and J. Kazempour, “Differentially private optimal power flow for distribution grids,” IEEE Trans. Power Syst., vol. 36, no. 3, pp. 2186–2196, 2020.
  • [157] J. Qi, Q. Zhou, L. Lei, and K. Zheng, “Federated reinforcement learning: Techniques, applications, and open challenges,” arXiv preprint arXiv:2108.11887, 2021.
  • [158] X. Fan, Y. Ma, Z. Dai, W. **g, C. Tan, and B. K. H. Low, “Fault-tolerant federated reinforcement learning with theoretical guarantee,” in Proc. Adv. Neural Inf. Process. Syst., vol. 34, 2021, pp. 1007–1021.
  • [159] B. Amos, L. Xu, and J. Z. Kolter, “Input convex neural networks,” in Proc. Int. Conf. Mach. Learn., 2017, pp. 146–155.
  • [160] Y. Chen, Y. Shi, and B. Zhang, “Optimal control via neural networks: A convex approach,” in Proc. Int. Conf. Learn. Representations, 2018.
  • [161] M. Yu, Z. Yang, M. Kolar, and Z. Wang, “Convergent policy optimization for safe reinforcement learning,” in Proc. Adv. Neural Inf. Process. Syst., vol. 32, 2019.
  • [162] X. Sun, Z. Xu, J. Qiu, H. Liu, H. Wu, and Y. Tao, “Optimal volt/var control for unbalanced distribution networks with human-in-the-loop deep reinforcement learning,” IEEE Trans. Smart Grid, vol. 15, no. 3, pp. 2639–2651, 2024.
  • [163] L. Yang, Q. Sun, N. Zhang, and Z. Liu, “Optimal energy operation strategy for we-energy of energy internet based on hybrid reinforcement learning with human-in-the-loop,” IEEE Trans. Syst., Man, Cybern.: Syst., vol. 52, no. 1, pp. 32–42, 2022.
  • [164] S. Majumder, L. Dong, F. Doudi, Y. Cai, C. Tian, D. Kalathi, K. Ding, A. A. Thatte, N. Li, and L. Xie, “Exploring the capabilities and limitations of large language models in the electric energy sector,” Joule, 2024, to appear.