Reinforcement Learning for Efficient Design and Control
Co-optimisation of Energy Systems

Marine Cauz    Adrien Bolland    Christophe Ballif    Nicolas Wyrsch
Abstract

The ongoing energy transition drives the development of decentralised renewable energy sources, which are heterogeneous and weather-dependent, complicating their integration into energy systems. This study tackles this issue by introducing a novel reinforcement learning (RL) framework tailored for the co-optimisation of design and control in energy systems. Traditionally, the integration of renewable sources in the energy sector has relied on complex mathematical modelling and sequential processes. By leveraging RL’s model-free capabilities, the framework eliminates the need for explicit system modelling. By optimising both control and design policies jointly, the framework enhances the integration of renewable sources and improves system efficiency. This contribution paves the way for advanced RL applications in energy management, leading to more efficient and effective use of renewable energy sources.

Reinforcement Learning, Energy, Renewable, Optimisation

1 Introduction

1.1 Background and motivation

Energy systems are undergoing significant transformations to meet increasing demands for sustainability and energy efficiency, particularly through the integration of decentralised and intermittent renewable energy sources. Traditionally, these systems are developed in two distinct phases: design, which determines the optimal size of components, and control, which focuses on their optimal operation. This sequential approach, as highlighted by (Dranka et al., 2021), often leads to inefficiencies and missed opportunities for optimal performance. To address the increasing complexity driven by renewable integration, co-optimisation has emerged as a key approach, jointly handling design and control to enhance system reliability and affordability. Recent literature underscores the importance of co-optimising design and operation using techniques such as linear programming (Krishnan et al., 2016; Daadaa et al., 2021; Jayadev et al., 2020), stochastic models (Clack et al., 2015; Qiu et al., 2017), robust optimisation (Popovici & Winston, 2015; Khojasteh, 2020), and evolutionary algorithms (Li et al., 2018; Gjorgiev & Sansavini, 2018; Bao et al., 2019). Among these methods, Mixed-Integer Linear Programming (MILP) is the most commonly used but requires mathematical modelling of the system and its interactions. These methods aim to optimise performance comprehensively while addressing uncertainties and multi-objective challenges. Overall, these diverse approaches highlight both the technical challenges and the critical importance of co-optimisation in enhancing the efficiency and sustainability of energy systems (Sachio et al., 2022; Fazlollahi & Maréchal, 2013; Dranka et al., 2021).

Data-driven methods, such as reinforcement learning (RL), have shown significant potential in computing control policies across various applications, including energy, offering a promising alternative to traditional approaches (François-Lavet et al., 2018; Quest et al., 2022; Perera et al., 2020). However, standard RL methods typically focus solely on operational control without integrating system design, limiting insights into how design changes influence outcomes. Despite its potential, RL is not fully exploited in the energy field (Perera & Kamalaruban, 2021). Recent advancements in RL, particularly gradient-based optimisation techniques like actor-critic methods, facilitate learning control policies for complex problems, opening new opportunities.

Building on these advancements, researchers have proposed algorithms to efficiently tackle joint design and control challenges. In (Schaff et al., 2019), the authors introduced an RL framework that optimises both design and control by maintaining a distribution over designs, using the Proximal Policy Optimization (PPO) algorithm (Schulman et al., 2017) for policy training and the reinforce update rule for design adjustments (Williams, 1992). This approach has been successfully applied in various robotic environments, outperforming other techniques (Bhatia et al., 2022; Ha, 2019). Alternatively, (Luck et al., 2020) enhances adaptability for joint design and control using Soft Actor-Critic (SAC) (Haarnoja et al., 2018), despite involving complex optimisation problems. The algorithm from (Bolland et al., 2022) refines this approach by combining policy gradients with model-based optimisation, It was applied to systems with photovoltaic (PV) panels and battery (Cauz et al., 2023), though it faces limitations due to finite time horizons and on-policy nature. Other approaches (Chen et al., 2020; Jackson et al., 2021) focus on learning system parameters directly, assuming the system dynamics are parameterised, but are restrictive when modelling complex energy systems where design decisions are directly related to explicit costs or rewards.

1.2 Contribution

Capitalising on these recent developments in policy gradient techniques, this study advances an integrated RL framework specifically tailored to address the co-optimisation challenges within energy systems. As introduced by (Schaff et al., 2019), the proposed framework employs a parametric design distribution, whose parametric nature is effective for modelling distributions over continuous supports and allows for using gradient based methods easily. This approach contrasts with most of the previous methods that employ a deterministic representation of the design variable (Chen et al., 2020; Jackson et al., 2021; Bolland et al., 2022), which can make model-free optimisation and efficient exploration challenging. Additionally, this framework distinguishes from (Schaff et al., 2019) by incorporating entropy regularisation, as in (Haarnoja et al., 2018), into the optimisation process to prevent convergence to local optima. Furthermore, this framework relies on a deterministic policy parameterisation, which is optimised using an off-policy actor-critic algorithm, namely Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2019). This allows for accommodating infinite time horizons, addressing a significant gap in methodologies (Bolland et al., 2022; Cauz et al., 2023). Unlike most existing studies, including (Schaff et al., 2019), the control policy training is off-policy, thereby enhancing sample efficiency by learning from a diverse range of past experiences stored in a replay buffer. Finally, this framework is also model-free, eliminating the need for a predefined mathematical model of the system, which simplifies implementation and broadens its applicability. None of the previously cited methods combine all these features.

By integrating these capabilities, this approach maximises the potential of RL to address the co-optimisation of design and operation within energy systems, a challenge often overlooked in RL research. This integrated framework bridges the gap between theoretical RL research and its practical application in energy systems, establishing a new benchmark for employing RL to tackle co-optimisation challenges in the energy sector.

The paper is structured as follows: Section 2 details the proposed RL method, covering both control and design aspects. Section 3 describes the energy system and experimental setup. Section 4 presents the findings, with Section 5 discussing their implications and potential impact. Finally, Section 6 summarizes the key insights and contributions of the research.

2 Method

This section outlines the conventional RL approach for system control and then details the adaptations made to enable learning system designs.

2.1 Control Policy

Formally, RL is conceptualised as an interplay between an agent and an environment. This environment is mathematically formalised as a Markov Decision Process (MDP) (Bellman, 1957), which is defined by its model \mathcal{M}caligraphic_M = (𝒮,𝒜,T,R,p0,γ)𝒮𝒜𝑇𝑅subscript𝑝0𝛾(\mathcal{S},\mathcal{A},T,R,p_{0},\gamma)( caligraphic_S , caligraphic_A , italic_T , italic_R , italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_γ ), where 𝒮𝒮\mathcal{S}caligraphic_S denotes the state space, 𝒜𝒜\mathcal{A}caligraphic_A denotes the action space, T:𝒮×𝒜×𝒮[0,1]:𝑇𝒮𝒜𝒮01T:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow[0,1]italic_T : caligraphic_S × caligraphic_A × caligraphic_S → [ 0 , 1 ] denotes the transition function (i.e., T(st+1|st,at)𝑇conditionalsubscript𝑠𝑡1subscript𝑠𝑡subscript𝑎𝑡T(s_{t+1}|s_{t},a_{t})italic_T ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) denotes the probability of reaching a state st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT when taking an action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT), R:𝒮×𝒜:𝑅𝒮𝒜R:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}italic_R : caligraphic_S × caligraphic_A → blackboard_R denotes the reward function (i.e., R(st,at)𝑅subscript𝑠𝑡subscript𝑎𝑡R(s_{t},a_{t})italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is the immediate reward received by taking action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT), p0:𝒮[0,1]:subscript𝑝0𝒮01p_{0}:\mathcal{S}\rightarrow[0,1]italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : caligraphic_S → [ 0 , 1 ] denotes the initial distribution, γ[0,1)𝛾01\gamma\in[0,1)italic_γ ∈ [ 0 , 1 ) denotes the discount factor (i.e., γ𝛾\gammaitalic_γ models the importance of future rewards, with a lower value placing more emphasis on immediate rewards). Within the MDP framework, the agent’s objective is to find a policy, πΠ𝜋Π\pi\in\Piitalic_π ∈ roman_Π, namely a conditional distribution over actions that can be used to take actions in each state by sampling. The optimal policy denoted πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT maximises the cumulative reward, called expected return of the policy: 𝔼[t]𝔼delimited-[]subscript𝑡\mathbb{E}\left[\mathcal{R}_{t}\right]blackboard_E [ caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ], such as t=t=0γtR(st,at)subscript𝑡superscriptsubscript𝑡0superscript𝛾𝑡𝑅subscript𝑠𝑡subscript𝑎𝑡\mathcal{R}_{t}=\sum_{t=0}^{\infty}\gamma^{t}R(s_{t},a_{t})caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

Actor-critic algorithms combine policy gradient and value-based methods for efficient policy learning and evaluation (François-Lavet et al., 2018). The actor proposes actions based on a policy π𝜋\piitalic_π modelled by a neural network with parameters θ𝜃\thetaitalic_θ, while the critic evaluates these actions by estimating value functions. This mechanism allows for ongoing refinement of the policy based on the critic’s feedback and updating the critic as the policy changes. Among the various actor-critic implementations, DDPG (Silver et al., 2014; Lillicrap et al., 2019) stands out due to its off-policy nature, meaning the policy can be improved using trajectories where actions are taken from another policy, and is suitable for environments with continuous action spaces. The critic approximates the state-action value function Qθ(s,a)superscript𝑄𝜃𝑠𝑎Q^{\theta}(s,a)italic_Q start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_s , italic_a ), aiding in policy update gradients. To ensure stable learning, DDPG employs target networks for temporal-difference learning benchmarks and adds Gaussian noise to policy outputs for sufficient exploration.

2.2 Design Policy

Conventional RL typically focuses on optimising a control policy πθsubscriptsuperscript𝜋𝜃\pi^{*}_{\theta}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT for a fixed system design. Building on this primary objective, this study explores both the design space X𝑋Xitalic_X and control strategies to identify an optimal system design xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and its corresponding control policy πθ(at|st,x)subscriptsuperscript𝜋𝜃conditionalsubscript𝑎𝑡subscript𝑠𝑡superscript𝑥\pi^{*}_{\theta}(a_{t}|s_{t},x^{*})italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). To each design xX𝑥𝑋x\in Xitalic_x ∈ italic_X corresponds a different MDP, as defined in Subsection 2.1. The objective is to maximise the expected return over a design distribution, effectively co-optimising design and control to enhance overall system performance. The proposed RL framework extends the traditional control policy optimisation by incorporating a probability distribution pϕ(x)subscript𝑝italic-ϕ𝑥p_{\phi}(x)italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ) over potential designs xX𝑥𝑋x\in Xitalic_x ∈ italic_X. The learnable parameters ϕitalic-ϕ\phiitalic_ϕ represent the parameters of this design distribution. The ultimate goal is to find the optimal parameters ϕsuperscriptitalic-ϕ\phi^{*}italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT that jointly maximise the expected discounted reward:

ϕ,θ=argmaxϕ,θ𝔼xpϕ()[𝔼s0p0()atπθ(|st,x)st+1T(|st,at)[t]]\displaystyle\phi^{*},\theta^{*}=\operatorname*{arg\,max}_{\phi,\theta}\mathop% {\mathbb{E}}_{x\sim p_{\phi}(\cdot)}\left[\mathop{\mathbb{E}}_{\begin{subarray% }{c}s_{0}\sim p_{0}(\cdot)\\ a_{t}\sim\pi_{\theta}(\cdot|s_{t},x)\\ s_{t+1}\sim T(\cdot|s_{t},a_{t})\end{subarray}}\left[\mathcal{R}_{t}\right]\right]italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_ϕ , italic_θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( ⋅ ) end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( ⋅ ) end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x ) end_CELL end_ROW start_ROW start_CELL italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_T ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG end_POSTSUBSCRIPT [ caligraphic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ] (1)

The co-optimisation framework is designed to maximise the expected discounted reward by effectively integrating system design and control. It is compatible with any standard RL algorithm, however, this implementation specifically uses the DDPG algorithm. This algorithm adapts the control policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to maximise expected returns across a range of designs drawn from the design probability distribution pϕsubscript𝑝italic-ϕp_{\phi}italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT. Each training iteration consists of two concurrent processes:

  • The control policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is refined using gradient ascent to enhance reward expectations over the sampled designs.

  • Simultaneously, the design distribution pϕsubscript𝑝italic-ϕp_{\phi}italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT is updated to increase the likelihood of designs that yield higher performance under the current policy.

Algorithm 1 Co-optimisation of design and control
  Initialise actor πθ(s,x)subscript𝜋𝜃𝑠𝑥\pi_{\theta}(s,x)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s , italic_x ) and critic Qθ(s,a,x)subscript𝑄𝜃𝑠𝑎𝑥Q_{\theta}(s,a,x)italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s , italic_a , italic_x )
  Initialise target networks and replay buffer (capacity N𝑁Nitalic_N)
  Initialise design distribution pϕ(x)subscript𝑝italic-ϕ𝑥p_{\phi}(x)italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x )
  repeat
     Sample designs {x1,,xd}subscript𝑥1subscript𝑥𝑑\{x_{1},\ldots,x_{d}\}{ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT } from pϕ(x)subscript𝑝italic-ϕ𝑥p_{\phi}(x)italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x )
     Compute expected return Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with DDPG
     Update critic by minimising the loss:
     L(θQ)=1Nn=0N1(ynQθ(sn,an,xn))2𝐿superscript𝜃𝑄1𝑁superscriptsubscript𝑛0𝑁1superscriptsubscript𝑦𝑛subscript𝑄𝜃subscript𝑠𝑛subscript𝑎𝑛subscript𝑥𝑛2L(\theta^{Q})=\frac{1}{N}\sum_{n=0}^{N-1}(y_{n}-Q_{\theta}(s_{n},a_{n},x_{n}))% ^{2}italic_L ( italic_θ start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
     Update actor by one step of gradient descent:
     θπJ(θQ,θπ)subscriptsuperscript𝜃𝜋𝐽superscript𝜃𝑄superscript𝜃𝜋absent\quad\nabla_{\theta^{\pi}}J(\theta^{Q},\theta^{\pi})\approx∇ start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_J ( italic_θ start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT , italic_θ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) ≈
     1Nn=0N1θπQθ(sn,πθ(sn,xn),xn)1𝑁superscriptsubscript𝑛0𝑁1subscriptsuperscript𝜃𝜋subscript𝑄𝜃subscript𝑠𝑛subscript𝜋𝜃subscript𝑠𝑛subscript𝑥𝑛subscript𝑥𝑛\frac{1}{N}\sum_{n=0}^{N-1}\nabla_{\theta^{\pi}}Q_{\theta}(s_{n},\pi_{\theta}(% s_{n},x_{n}),x_{n})divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )
     Update target networks
     Compute the loss function for the design update:
     (ϕ)=1di=1d(logpϕ(xi)Ri,tλlogpϕ(xi))italic-ϕ1𝑑subscriptsuperscript𝑑𝑖1subscript𝑝italic-ϕsubscript𝑥𝑖subscript𝑅𝑖𝑡𝜆subscript𝑝italic-ϕsubscript𝑥𝑖\mathcal{L}(\phi)=-\frac{1}{d}\sum^{d}_{i=1}\left(\log p_{\phi}(x_{i})\cdot R_% {i,t}-\lambda\cdot\log p_{\phi}(x_{i})\right)caligraphic_L ( italic_ϕ ) = - divide start_ARG 1 end_ARG start_ARG italic_d end_ARG ∑ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ( roman_log italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ italic_R start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT - italic_λ ⋅ roman_log italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) )
     Update pϕsubscript𝑝italic-ϕp_{\phi}italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT by minimising the loss with respect to ϕitalic-ϕ\phiitalic_ϕ
  until End of training

Algorithm 1 describes the co-optimisation procedure. It starts with the initialisation of the design distribution, fostering a wide-ranging exploration of designs. During training, the framework adjusts the policy parameters θ𝜃\thetaitalic_θ and the design parameters ϕitalic-ϕ\phiitalic_ϕ to gradually phase out less effective designs, allowing the policy to specialise and focus on a narrowing set of promising designs. As a result, the variance within the design distribution pϕsubscript𝑝italic-ϕp_{\phi}italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT decreases, guiding the system towards the convergence on an optimal design xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and associated policy πθsubscriptsuperscript𝜋𝜃\pi^{*}_{\theta}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, thereby maximising the overall system performance.

In comparison with the framework proposed by (Schaff et al., 2019), two notable modifications in the design distribution enhance its suitability for energy systems. Firstly, instead of using a Gaussian Mixture Model to parameterise the design distribution pϕsubscript𝑝italic-ϕp_{\phi}italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, which may require clip** to ensure physical feasibility, this framework employs a log-normal mixture model. This model inherently restricts the design space to X=+𝑋superscriptX=\mathbb{R}^{+}italic_X = blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, ensuring all design values remain within physically feasible limits for energy systems. The mixture model parameters, including the mean and variance of each log-normal component and their respective (unscaled) weights, are updated using stochastic gradient ascent based on the reinforce gradient estimates (Williams, 1992). The second modification introduces entropy regularisation to the design distribution to mitigate the risk of local optima, a common challenge in energy system optimisations as noted in (Cauz et al., 2023). Initially, the design distribution is set with random means and high variance to encourage diverse explorations. Additionally, an entropy term is added in the loss function (Ahmed et al., 2019), which gradually decreases to strategically reduce exploration over time. There is no straightforward computation of the entropy of a log-normal mixture model, hence the entropy is estimated by extending the return with the log probability of the design samples, bypassing the need for computationally intensive methods.

3 Experiments

This section describes the experimental set up to evaluate the proposed framework on a building-scale PV-battery system. The aim is to minimise total electricity costs by optimising both the investment in system components, namely the design parameters, and the operational strategies for storage management, meaning the control policy. Operational costs are derived from grid interactions required to meet building energy demands. Performance is measured against the average expected return of the system’s total cost, reflecting the economic impact of chosen design and control strategies. For comparative analysis, the RL co-optimisation is benchmarked against traditional approaches highlighted in Section 1. First, it is compared to a MILP approach for selecting the best design, followed by an RL technique for determining the optimal policy for this design. Second, it is compared to expert rule-based controllers.

3.1 Building-Scale System

The system is a building-scale energy system within an office setting, equipped with a PV installation and a stationary lithium-ion battery to satisfy its electricity requirements. The system also features a bidirectional EV (Electric Vehicle) charging point, whose usage is stochastically modelled based on typical patterns. Moreover, the building is connected to the electrical grid, subject to dynamically varying electricity prices. The main objective is to determine the optimal design for the PV installation and the battery capacity, while simultaneously develo** an optimal control policy for battery and EV management. This aims to minimise the total cost of ownership, encompassing both capital and operational expenses, as well as grid costs. The design and control model of this system is formulated using an MDP, which is described in detailed in Appendix A.

The model is trained on a historical one-year dataset of normalised PV production and electrical consumption, divided into training and validation sets to capture seasonal fluctuations. This dataset is supplemented with synthetic data for dynamic grid tariffs and EV arrival times. Details of both datasets are provided in Appendix A. The MDP’s time horizon is truncated after T=168𝑇168T=168italic_T = 168 hours (one week), with long-term dependencies captured via bootstrap** in the critic training. Ideally, the time horizon would cover an entire year or the system’s lifecycle to capture seasonal production and consumption variations and potential equipment degradation. Performance is regularly evaluated in two ways, (i) across the full training dataset, corresponding to T=8088𝑇8088T=8088italic_T = 8088 hours, to assess long-term effectiveness, and (ii) across the full validation dataset, corresponding to T=672𝑇672T=672italic_T = 672 hours, to avoid overfitting.

3.1.1 Experiment setup

The actor πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and critic Qθsubscript𝑄𝜃Q_{\theta}italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT are both implemented using neural networks, each consisting of two hidden layers with 256 neurons and ReLU activation functions. For the actor network, a tanh activation function is applied in the final layer to map the output to the action space. The critic network concatenates the state and action at the input layer, with a linear activation function in the output layer. To facilitate integration with existing RL libraries, the design parameters x𝑥xitalic_x are appended to the state variables before they are inputted into the control network.

The design distribution, pϕ(x)subscript𝑝italic-ϕ𝑥p_{\phi}(x)italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x ), is modelled as a log-normal mixture with three components, each parameterised with two design parameters. The means, variances and weights of each component are initialised randomly within the interval [0,1[01\left[0,1\right[[ 0 , 1 [, set high and uniformly distributed, respectively, to ensure the distribution covers a large range of +superscript\mathbb{R}^{+}blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT. The entropy weight linearly decreases throughout training and reaches zero during the last half of iterations.

The framework employs the DDPG algorithm to train the control policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. Each iteration consists of a batch of 32 episodes, each lasting T=168𝑇168T=168italic_T = 168 hours, i.e., one week. Additionally, at every iterations, a set of design parameters is sampled and evaluated with the current policy across the full training dataset to monitor performance over a duration close to one year, i.e., T=8088𝑇8088T=8088italic_T = 8088 hours. To prevent gradient explosion during training, gradient clip** is implemented. Moreover, the performance are evaluated every iterations still with a batch of 32 episodes across the full validation dataset, i.e., T=672𝑇672T=672italic_T = 672, with the current design distribution and control policy. Finally, the medians and quartiles of the design parameter distribution are computed at the end of training, after 500 iterations. To ensure reliability and account for variability in initialisation, all experiments are conducted using 30 different seed values ranging from 0 to 30.

3.1.2 Rule-based baseline

A rule-based baseline is established as a fixed control policy, focusing solely on optimising the design. This setup allows for a direct comparison between joint optimisation using DDPG and simple design optimisation under a given expert policy. The rule-based discharges the stationary battery when consumption exceeds PV production and charges it when production is higher than consumption. The bidirectional EV’s battery, when available, follows the same logic to augment the system’s capacity. This rule-based controller operates within the same MDP environment but does not require a training phase, as it involves no trainable control parameters θ𝜃\thetaitalic_θ. Performance evaluations of the system’s design under this controller are conducted over 500 iterations, focusing exclusively on updating the design parameters ϕitalic-ϕ\phiitalic_ϕ, given that the control policy is static and predetermined. This experiment is referred to as the design-only scenario, as only the design parameters are trained, without assuming perfect foresight.

3.1.3 MILP baseline

MILP is the most widely used tools for designing energy systems (Dranka et al., 2021; Perera & Kamalaruban, 2021). This approach requires mathematical modelling of the system and its interactions, assuming a perfect foresight approach. In this study, an environment formulated as a mathematical program consisting of constraints and objectives is developed, similar to the MDP presented in Appendix A. This formulation allows for computing the optimal design, which is then controlled by a policy learned using DDPG for this particular design. This methodology enables benchmarking the proposed co-optimisation framework against a two-step baseline where the design is initially computed using MILP over the full training dataset, and subsequently controlled with DDPG. This experiment is referred to as the best two-step scenario because it involves a co-optimisation using the best response two-step algorithm.

Moreover, a final scenario, referred to as the fixed scenario, is computed based on the rule-based control. The rule-based policy is implemented as constraints in the MILP (actions are constrained based on the state), allowing for the computation of the optimal design for this fixed control policy using MILP. This experiment provides a static performance, assuming fixed design and control. Since there are no training parameters, this scenario is computed to verify that the solutions provided exceed this fixed baseline.

4 Results

This section details the performance of the proposed framework in co-optimising the design and operation of a building-scale PV-battery system.

4.1 Training Dynamics


Refer to caption

Figure 1: Training performances, over 500 iterations, for the co-optimisation (blue), best two-step (orange), and design-only (green) scenarios. Experiments were conducted using seed values ranging from 0 to 30, with the figure showing the median and quartiles. The top subplot illustrates the evolution of average expected returns on T𝑇Titalic_T=168, i.e., the effective training. The bottom subplot assesses the average expected return throughout the full training dataset on T𝑇Titalic_T=8088, i.e., the long-term performance.

Figure 1 tracks the performances over 500 iterations during training of (i) the co-optimisation using DDPG (blue), (ii) the best two-step optimisation using DDPG for control with a fixed design derived from the MILP (orange), (iii) the design-only scenario using a rule-based control policy while optimising the design distribution (green), and (iv) the fixed scenario corresponding to the solution provided first by the MILP design computed with the rule-based constraints and then by applying the rule-based control policy (black). For all scenarios, 30 experiments are conducted with different seeds ranging from 0 to 30. Figure 1 reports the median and quartiles of the return during each learning procedure. In the best two-step scenario, the design parameters resulting from the MILP are fixed at 6666 kWp for PV and 14141414 kWh for battery capacity. In the fixed scenario, the design parameters resulting from the MILP are fixed at 3333 kWp for PV and 14141414 kWh for battery capacity, indicating that the integration of the rule-based constraint within the MILP constraints reduces the optimal PV power by half.

The top subplot of Figure 1 illustrates the weekly average expected returns, computed from designs sampled from the current design distribution in the co-optimisation and design-only scenarios, across batches of 32 episodes, each lasting T=168𝑇168T=168italic_T = 168 hours. For these two scenarios, design parameters ϕitalic-ϕ\phiitalic_ϕ are updated during training and then the weekly average expected return stabilises by 500 iterations at mean values of 41.741.7-41.7- 41.7 and 82.982.9-82.9- 82.9 at the last iteration, respectively, as reported in Table 1. In the best two-step scenario, due to its static design, training converges faster, with results stabilising around 44.844.8-44.8- 44.8. In the fixed scenario, since the design was previously computed using MILP and the control policy is predefined, there is no further optimisation, and it converges to 85.985.9-85.9- 85.9. The variations are linked to the samples in the initial states. The bottom subplot of Figure 1 evaluates long-term performance over a batch of 32 episodes, each with a duration of the entire training dataset, T=8088𝑇8088T=8088italic_T = 8088 hours. The difference in performance between the co-optimisation and the best two-step scenario remains similar, with results converging to 49.849.8-49.8- 49.8 and 54.854.8-54.8- 54.8, respectively. In the design-only and fixed scenarios, the results slightly decrease compared to the weekly results, converging to 101.1101.1-101.1- 101.1 and 104.3104.3-104.3- 104.3, respectively. This assessment confirms that the co-optimisation maintains performance over extended operational periods, which is essential for the infinite horizon characteristic of energy systems.

Table 1: Average expected returns and standard deviation at the last iteration over the 30 seed experiments for training, long-term performance, and validation in the co-optimisation, best two-step, design-only, and fixed scenarios.

Scenario Training Long-term Validation
T𝑇Titalic_T=168 T𝑇Titalic_T=8088 T𝑇Titalic_T=672
Co-optim. -41.7±plus-or-minus\pm±3.2 -49.8±plus-or-minus\pm±4.9 -50.1±plus-or-minus\pm±2.1
Best 2-step -44.8±plus-or-minus\pm±4.4 -54.8±plus-or-minus\pm±4.1 -54.5±plus-or-minus\pm±0.0
Design-only -82.9±plus-or-minus\pm±6.5 -101.1±plus-or-minus\pm±7.9 -97.1±plus-or-minus\pm±7.4
Fixed -85.9±plus-or-minus\pm±0.0 -104.3±plus-or-minus\pm±0.0 -99.6±plus-or-minus\pm±0.0

4.2 Evaluation Process


Refer to caption

Figure 2: Validation performances, over 500 iterations, for the co-optimisation (blue), best two-step (orange), design-only (green), and fixed (black) scenarios. Experiments were conducted using seed values ranging from 0 to 30. The figure shows the median and quartiles of the average expected return computed over the entire validation dataset, i.e., T𝑇Titalic_T=672.

The validation process, illustrated in Figure 2, involves computing the average expected return of the current control policy and designs sampled from the current design distribution every iteration over the full validation dataset, i.e., T=672𝑇672T=672italic_T = 672 hours. The validation performance of the co-optimisation scenario (blue) is benchmarked against the best two-step scenario (orange), the design-only scenario (green), and the fixed scenario (black). The experiments are conducted using 30 different seed values, with the median and quartiles reported in Figure 2. All scenarios quickly converge to a unique solution for this specific validation episode over T=672𝑇672T=672italic_T = 672 hours. Interestingly, the difference in performance between the co-optimisation and best two-step scenarios is greater than during the training process. This might result from the perfect foresight approach in the MILP parameter selection, which allows for selection based on future information during learning that is unknown at the time of evaluation. As reported in Table 1, the scenarios converge to the following average values at the last iteration: 50.150.1-50.1- 50.1 for co-optimisation, 54.554.5-54.5- 54.5 for best two-step, 97.197.1-97.1- 97.1 for design-only, and 99.699.6-99.6- 99.6 for the fixed scenario.

Figure 3 represents the final distribution (estimated with 1000 samples) after 500 training iterations of one of the 30 seed experiments, for both the co-optimisation scenario and the design-only scenario. The median and quartiles are used to highlight the narrow confidence interval of the parameter distribution within the design space X=+𝑋superscriptX=\mathbb{R}^{+}italic_X = blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT. This illustration effectively shows that both scenarios converge to similar optimal design parameter intervals. For co-optimisation with DDPG (blue), the interval between the first and third quartile is [3,6]36\left[3,6\right][ 3 , 6 ] kWp for PV and [5,10]510\left[5,10\right][ 5 , 10 ] kWh for battery capacity. For the design-only scenario with the rule-based control policy (green), the interval between the first and third quartile is [2,5]25\left[2,5\right][ 2 , 5 ] kWp for PV and [0,1]01\left[0,1\right][ 0 , 1 ] kWh for battery capacity. The mean design parameter over all 30 scenarios is reported in Table 2. Note that these design parameter values are consistent with the assumptions of the building-scale system environment. Additionally, they differ from those computed using MILP, which in the best two-step scenario are equivalent to 6 kWp for PV and 14 kWh for battery capacity, reflecting different optimisation dynamics.


Refer to caption

Figure 3: Design parameter distribution after training for the co-optimisation (top, blue) and design-only (bottom, green) scenarios. The boxplots are computed based on a sample of 1000 designs drawn from the final design distribution of one of the 30 seed experiments.
Table 2: Mean of the design distribution at the last iteration, averaged over the 30 seed experiments in the co-optimisation and design-only scenarios. For the best two-step and fixed scenarios, the reported values are those computed using MILP.

Design parameter PV Battery
Co-optim. 6.6±plus-or-minus\pm±1.4 4.5±plus-or-minus\pm±0.9
Best 2-step 6 14
Design-only 6.3±plus-or-minus\pm±5.3 3.5±plus-or-minus\pm±12.2
Fixed 3 14

5 Discussion

This study investigates the co-optimisation of design and operation in energy systems using a novel RL framework. The primary goal was to assess the feasibility and effectiveness of RL in develo** integrated design strategies within a co-optimisation framework, aiming to enhance system performance by minimising total electricity costs. The results confirm that the framework successfully converges to high-performing design parameters while achieving superior control performance in both short and long-term periods. The co-optimisation scenario outperforms both the design-only and best two-step scenarios in training and validation performances, while converging to different design parameter values.

First, the two RL-based design optimisations, i.e., the co-optimisation and design-only scenarios, converged to different design parameters while achieving significantly different operational performances, underscoring the significance of co-optimisation. The convergence to optimal design parameters in both scenarios is evidenced by the narrowing of the boxplot charts, indicating a non-dispersed solution. The optimised design parameters, although modest, align realistically with the environmental conditions and model assumptions. Additionally, they should be considered in relation to electricity requirements (i.e., 2.5 kWh on average per hour). These results highlight the framework’s capacity to provide precise solutions.

The choice of DDPG among actor-critic algorithms was motivated by its off-policy nature and suitability for environments with continuous action spaces. These characteristics make DDPG significantly more sample-efficient, an advantage in the energy sector where system designs are typically based on data from a single year or, at best, a few years. The fast convergence of the control parameters further confirms the suitability of this algorithm for energy applications.

The main limitation of this proposed framework is the difficulty in guaranteeing an optimal design, in contrast to the one computed using MILP. The results can vary due to sensitivity to hyperparameters, necessitating a detailed analysis of the evolution to ensure convergence to an optimal solution. This becomes even more complex when different algorithms are used and converge to different solutions, owing to the sensitivity to hyperparameters that must be carefully studied. Additionally, this study examines two design parameters. Scaling up the method to include additional design parameters presents two main challenges. First, it increases the difficulty in sampling interesting design spaces for all parameters, likely requiring more iterations. Second, it results in higher variance in the gradient estimates. This is analogous to problems where the optimal control policy is learned.

Finally, two important advantages from the energy perspective are: first, the framework provides an interval of optimal design values rather than a unique solution, as MILP typically does, offering more flexibility and sensitivity information. Second, this framework offers better performance without assuming perfect foresight, likely explaining the superior validation performance in Figure 2, as the MILP did not have access to the validation dataset while computing the design parameters.

6 Conclusion

The primary achievement of this study has been leveraging theoretical advances in RL to bridge the gap with practical energy challenges, focusing on the co-optimisation of design and operation within energy systems. This work has harnessed recent developments in policy gradient techniques to introduce an integrated, off-policy, and model-free RL framework tailored to tackle the co-optimisation challenge in energy systems.

The successful demonstration of RL’s feasibility and effectiveness in develo** integrated design strategies within a co-optimisation framework paves the way for future research and expands the capabilities of RL in the energy sector. This conclusion aligns with two notable reviews: (Dranka et al., 2021), which underscores the importance of addressing co-optimisation in energy and highlights the absence of integrated solutions, and (Perera & Kamalaruban, 2021), which notes that RL is not fully exploited in energy and suggests that using RL for design would be a promising new research area.

The outcomes validate the relevance of using RL to design energy systems, demonstrating how co-optimisation can effectively compute control and design policies jointly, and surpass traditional approaches. Additionally, this framework does not mandate a specific control algorithm or restrict to RL alone, instead, it requires the problem to be formulated as an MDP. Adherence to RL standards, i.e., Gymnasium library (Towers et al., 2023), is advised to ensure seamless integration with existing control algorithms, even though they have been developed from scratch in this case.

The practical application reveals the framework potential through a single year’s data analysis. For greater accuracy and to evaluate long-term co-optimisation effects, it is advantageous to extend the dataset to encompass multiple years. Expanding the dataset would enhance the framework’s ability to manage annual fluctuations in energy supply and demand. Further complexity could be introduced into the energy system model, like integrating multiple electric vehicles and accounting for non-linear heat pump dynamics. Additionally, incorporating more complex energy system dynamics such as real-time pricing or demand-response capabilities could improve the model’s precision and relevance. The framework has also demonstrated promising outcomes that suggest the potential for generalisation to enhance sim-to-real transfer (Peng et al., 2018), a significant step towards ensuring that the insights and predictions generated can be effectively applied in realistic operational settings (Schaff et al., 2023). Future directions might also include integrating a critic architecture directly into the design learning process and extend the off-policy nature to the design part.

In conclusion, the findings and the comparison to traditional approaches, such as the design-only and best two-step scenarios, highlight that optimal design and optimal control are intrinsically linked. These insights affirm the value of integrated co-optimisation strategies over traditional, segregated approaches, especially in complex and dynamic settings like modern energy systems.

Acknowledgements

The authors would like to thank Prof. Gilles Louppe for providing access to the Alan clusters, which facilitated the experiments in this work. Adrien Bolland gratefully acknowledges the financial support of a research fellowship of the F.R.S.-FNRS.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

References

  • Ahmed et al. (2019) Ahmed, Z., Roux, N. L., Norouzi, M., and Schuurmans, D. Understanding the impact of entropy on policy optimization, June 2019. URL http://arxiv.longhoe.net/abs/1811.11214.
  • Bao et al. (2019) Bao, Z., Chen, D., Wu, L., and Guo, X. Optimal inter- and intra-hour scheduling of islanded integrated-energy system considering linepack of gas pipelines. Energy, 171:326–340, March 2019. ISSN 0360-5442. doi: 10.1016/j.energy.2019.01.016. URL https://www.sciencedirect.com/science/article/pii/S0360544219300180.
  • Bellman (1957) Bellman, R. A markovian decision process. Journal of Mathematics and Mechanics, 6(5):679–684, 1957. ISSN 0095-9057. URL https://www.jstor.org/stable/24900506. Publisher: Indiana University Mathematics Department.
  • Bhatia et al. (2022) Bhatia, J. S., Jackson, H., Tian, Y., Xu, J., and Matusik, W. Evolution Gym: A Large-Scale Benchmark for Evolving Soft Robots, January 2022. URL http://arxiv.longhoe.net/abs/2201.09863.
  • Bolland et al. (2022) Bolland, A., Boukas, I., Berger, M., and Ernst, D. Jointly learning environments and control policies with projected stochastic gradient ascent. Journal of Artificial Intelligence Research, 73:117–171, 2022. ISSN 1076-9757. doi: 10.1613/jair.1.13350. URL https://www.jair.org/index.php/jair/article/view/13350.
  • Cauz et al. (2023) Cauz, M., Bolland, A., Miftari, B., Perret, L., Ballif, C., and Wyrsch, N. Reinforcement learning for joint design and control of battery-PV systems. Proceedings of ECOS 2023, 2023. doi: 10.52202/069564-0281. 36th International Conference on Efficiency, Cost, Optimization, Simulation and Environmental Impact of Energy Systems.
  • Chen et al. (2020) Chen, T., He, Z., and Ciocarlie, M. Hardware as Policy: Mechanical and Computational Co-Optimization using Deep Reinforcement Learning, November 2020. URL http://arxiv.longhoe.net/abs/2008.04460.
  • Clack et al. (2015) Clack, C. T. M., Xie, Y., and MacDonald, A. E. Linear programming techniques for develo** an optimal electrical system including high-voltage direct-current transmission and storage. International Journal of Electrical Power & Energy Systems, 68:103–114, June 2015. ISSN 0142-0615. doi: 10.1016/j.ijepes.2014.12.049. URL https://www.sciencedirect.com/science/article/pii/S0142061514007765.
  • Daadaa et al. (2021) Daadaa, M., Séguin, S., Demeester, K., and Anjos, M. F. An optimization model to maximize energy generation in short-term hydropower unit commitment using efficiency points. International Journal of Electrical Power & Energy Systems, 125:106419, February 2021. ISSN 0142-0615. doi: 10.1016/j.ijepes.2020.106419. URL https://www.sciencedirect.com/science/article/pii/S0142061519342218.
  • Dranka et al. (2021) Dranka, G. G., Ferreira, P., and Vaz, A. I. F. A review of co-optimization approaches for operational and planning problems in the energy sector. Applied Energy, 304:117703, December 2021. ISSN 0306-2619. doi: 10.1016/j.apenergy.2021.117703. URL https://www.sciencedirect.com/science/article/pii/S0306261921010588.
  • Fazlollahi & Maréchal (2013) Fazlollahi, S. and Maréchal, F. Multi-objective, multi-period optimization of biomass conversion technologies using evolutionary algorithms and mixed integer linear programming (MILP). Applied Thermal Engineering, 50(2):1504–1513, 2013. ISSN 1359-4311. doi: 10.1016/j.applthermaleng.2011.11.035. URL https://www.sciencedirect.com/science/article/pii/S1359431111006636.
  • François-Lavet et al. (2018) François-Lavet, V., Henderson, P., Islam, R., Bellemare, M. G., and Pineau, J. An introduction to deep reinforcement learning. Foundations and Trends in Machine Learning, 11(3):219–354, 2018. ISSN 1935-8237, 1935-8245. doi: 10.1561/2200000071. URL https://www.nowpublishers.com/article/Details/MAL-071.
  • Gjorgiev & Sansavini (2018) Gjorgiev, B. and Sansavini, G. Electrical power generation under policy constrained water-energy nexus. Applied Energy, 210:568–579, January 2018. ISSN 0306-2619. doi: 10.1016/j.apenergy.2017.09.011. URL https://www.sciencedirect.com/science/article/pii/S0306261917312977.
  • Ha (2019) Ha, D. Reinforcement Learning for Improving Agent Design. Artificial Life, 25(4):352–365, November 2019. ISSN 1064-5462. doi: 10.1162/artl˙a˙00301. URL https://doi.org/10.1162/artl_a_00301.
  • Haarnoja et al. (2018) Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Proceedings of the 35th International Conference on Machine Learning, pp.  1861–1870. PMLR, July 2018. URL https://proceedings.mlr.press/v80/haarnoja18b.html. ISSN: 2640-3498.
  • Jackson et al. (2021) Jackson, L., Walters, C., Eckersley, S., Senior, P., and Hadfield, S. Orchid: Optimisation of robotic control and hardware in design using reinforcement learning. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.  4911–4917, 2021. doi: 10.1109/IROS51168.2021.9635865.
  • Jayadev et al. (2020) Jayadev, G., Leibowicz, B. D., and Kutanoglu, E. U.S. electricity infrastructure of the future: Generation and transmission pathways through 2050. Applied Energy, 260:114267, February 2020. ISSN 0306-2619. doi: 10.1016/j.apenergy.2019.114267. URL https://www.sciencedirect.com/science/article/pii/S0306261919319543.
  • Khojasteh (2020) Khojasteh, M. A robust energy procurement strategy for micro-grid operator with hydrogen-based energy resources using game theory. Sustainable Cities and Society, 60:102260, September 2020. ISSN 2210-6707. doi: 10.1016/j.scs.2020.102260. URL https://www.sciencedirect.com/science/article/pii/S2210670720304819.
  • Krishnan et al. (2016) Krishnan, V., Ho, J., Hobbs, B. F., Liu, A. L., McCalley, J. D., Shahidehpour, M., and Zheng, Q. P. Co-optimization of electricity transmission and generation resources for planning and policy analysis: review of concepts and modeling approaches. Energy Systems, 7(2):297–332, May 2016. ISSN 1868-3975. doi: 10.1007/s12667-015-0158-4. URL https://doi.org/10.1007/s12667-015-0158-4.
  • Li et al. (2018) Li, B., Roche, R., Paire, D., and Miraoui, A. Optimal sizing of distributed generation in gas/electricity/heat supply networks. Energy, 151:675–688, May 2018. ISSN 0360-5442. doi: 10.1016/j.energy.2018.03.080. URL https://www.sciencedirect.com/science/article/pii/S0360544218304894.
  • Lillicrap et al. (2019) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arxiv, (arXiv:1509.02971), 2019. doi: 10.48550/arXiv.1509.02971. URL http://arxiv.longhoe.net/abs/1509.02971.
  • Luck et al. (2020) Luck, K. S., Amor, H. B., and Calandra, R. Data-efficient Co-Adaptation of Morphology and Behaviour with Deep Reinforcement Learning. In Proceedings of the Conference on Robot Learning, pp.  854–869. PMLR, May 2020. URL https://proceedings.mlr.press/v100/luck20a.html. ISSN: 2640-3498.
  • Peng et al. (2018) Peng, X. B., Andrychowicz, M., Zaremba, W., and Abbeel, P. Sim-to-Real Transfer of Robotic Control with Dynamics Randomization. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp.  3803–3810, May 2018. doi: 10.1109/ICRA.2018.8460528. URL http://arxiv.longhoe.net/abs/1710.06537.
  • Perera & Kamalaruban (2021) Perera, A. and Kamalaruban, P. Applications of reinforcement learning in energy systems. Renewable and Sustainable Energy Reviews, 137:110618, 2021. doi: 10.1016/j.rser.2020.110618.
  • Perera et al. (2020) Perera, A. T. D., Wickramasinghe, P. U., Nik, V. M., and Scartezzini, J.-L. Introducing reinforcement learning to the energy system design process. Applied Energy, 262:114580, 2020. ISSN 0306-2619. doi: 10.1016/j.apenergy.2020.114580. URL https://www.sciencedirect.com/science/article/pii/S0306261920300921.
  • Popovici & Winston (2015) Popovici, E. and Winston, E. A framework for co-optimization algorithm performance and its application to worst-case optimization. Theoretical Computer Science, 567:46–73, February 2015. ISSN 0304-3975. doi: 10.1016/j.tcs.2014.10.038. URL https://www.sciencedirect.com/science/article/pii/S0304397514008305.
  • Qiu et al. (2017) Qiu, T., Xu, B., Wang, Y., Dvorkin, Y., and Kirschen, D. S. Stochastic Multistage Coplanning of Transmission Expansion and Energy Storage. IEEE Transactions on Power Systems, 32(1):643–651, January 2017. ISSN 1558-0679. doi: 10.1109/TPWRS.2016.2553678. URL https://ieeexplore.ieee.org/document/7454784. Conference Name: IEEE Transactions on Power Systems.
  • Quest et al. (2022) Quest, H., Cauz, M., Heymann, F., Rod, C., Perret, L., Ballif, C., Virtuani, A., and Wyrsch, N. A 3d indicator for guiding AI applications in the energy sector. Energy and AI, 9:100167, 2022. ISSN 2666-5468. doi: 10.1016/j.egyai.2022.100167. URL https://www.sciencedirect.com/science/article/pii/S2666546822000234.
  • Sachio et al. (2022) Sachio, S., Mowbray, M., Papathanasiou, M. M., del Rio-Chanona, E. A., and Petsagkourakis, P. Integrating process design and control using reinforcement learning. Chemical Engineering Research and Design, 183:160–169, 2022. ISSN 0263-8762. doi: 10.1016/j.cherd.2021.10.032. URL https://www.sciencedirect.com/science/article/pii/S0263876221004421.
  • Schaff et al. (2019) Schaff, C., Yunis, D., Chakrabarti, A., and Walter, M. R. Jointly learning to construct and control agents using deep reinforcement learning. In 2019 International Conference on Robotics and Automation (ICRA), pp.  9798–9805. IEEE Press, 2019. doi: 10.1109/ICRA.2019.8793537. URL 10.1109/ICRA.2019.8793537.
  • Schaff et al. (2023) Schaff, C., Sedal, A., Ni, S., and Walter, M. R. Sim-to-real transfer of co-optimized soft robot crawlers. Autonomous Robots, 47(8):1195–1211, December 2023. ISSN 1573-7527. doi: 10.1007/s10514-023-10130-8. URL https://doi.org/10.1007/s10514-023-10130-8.
  • Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal Policy Optimization Algorithms, August 2017. URL http://arxiv.longhoe.net/abs/1707.06347.
  • Silver et al. (2014) Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on Machine Learning, pp.  387–395. PMLR, 2014. URL https://proceedings.mlr.press/v32/silver14.html.
  • Towers et al. (2023) Towers, M., Terry, J. K., Kwiatkowski, A., Balis, J. U., Cola, G. d., Deleu, T., Goulão, M., Kallinteris, A., KG, A., Krimmel, M., Perez-Vicente, R., Pierré, A., Schulhoff, S., Tai, J. J., Shen, A. T. J., and Younis, O. G. Gymnasium, March 2023. URL https://zenodo.org/record/8127025.
  • Williams (1992) Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3):229–256, 1992. ISSN 1573-0565. doi: 10.1007/BF00992696. URL 10.1007/BF00992696.

Appendix A Appendix Building-Scale System – Environment definition

This Annex details the building-scale energy system used within an office setting, equipped with a PV (Photovoltaic) installation and a stationary lithium-ion battery to satisfy its electricity requirements. The system also features a bidirectional EV (Electric Vehicle) charging point, whose usage is stochastically modelled based on typical patterns. Moreover, the building is connected to the electrical grid, subject to dynamically varying electricity prices. The main objective is to determine the optimal design for the PV installation (Pnomsuperscript𝑃nomP^{\textsc{nom}}italic_P start_POSTSUPERSCRIPT nom end_POSTSUPERSCRIPT) and the battery capacity (B𝐵Bitalic_B), while simultaneously develo** an optimal control policy for battery and EV management. This aims to minimise the total cost of ownership, encompassing both capital and operational expenses, as well as grid costs. The environment is formulated below as an MDP and Table 3 gathers all parameters of this environment.

Parameter Value Set Unit Description
Grid Pimpsuperscript𝑃impP^{\textsc{imp}}italic_P start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT +Tsuperscriptsubscript𝑇\mathbb{R}_{+}^{T}blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT kW imported power (from the grid)
Pexpsuperscript𝑃expP^{\textsc{exp}}italic_P start_POSTSUPERSCRIPT exp end_POSTSUPERSCRIPT +Tsuperscriptsubscript𝑇\mathbb{R}_{+}^{T}blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT kW exported power (to the grid)
Cgridimpsubscriptsuperscript𝐶impgridC^{\textsc{imp}}_{\textsc{grid}}italic_C start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT start_POSTSUBSCRIPT grid end_POSTSUBSCRIPT Tsuperscript𝑇\mathbb{R}^{T}blackboard_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT CHF/kWh imported electricity price
Cgridexpsubscriptsuperscript𝐶expgridC^{\textsc{exp}}_{\textsc{grid}}italic_C start_POSTSUPERSCRIPT exp end_POSTSUPERSCRIPT start_POSTSUBSCRIPT grid end_POSTSUBSCRIPT Tsuperscript𝑇\mathbb{R}^{T}blackboard_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT CHF/kWh exported electricity price
Cgridsubscript𝐶gridC_{\textsc{grid}}italic_C start_POSTSUBSCRIPT grid end_POSTSUBSCRIPT Tsuperscript𝑇\mathbb{R}^{T}blackboard_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT CHF total electricity grid cost
PV Pnomsuperscript𝑃nomP^{\textsc{nom}}italic_P start_POSTSUPERSCRIPT nom end_POSTSUPERSCRIPT +subscript\mathbb{R}_{+}blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT kWp nominal power of the PV installation
Pminnomsubscriptsuperscript𝑃nomminP^{\textsc{nom}}_{\textsc{min}}italic_P start_POSTSUPERSCRIPT nom end_POSTSUPERSCRIPT start_POSTSUBSCRIPT min end_POSTSUBSCRIPT 0 +subscript\mathbb{R}_{+}blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT kWp minimal nominal PV power
Pmaxnomsubscriptsuperscript𝑃nommaxP^{\textsc{nom}}_{\textsc{max}}italic_P start_POSTSUPERSCRIPT nom end_POSTSUPERSCRIPT start_POSTSUBSCRIPT max end_POSTSUBSCRIPT \infty +subscript\mathbb{R}_{+}blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT kWp maximal nominal PV power
Pprodsuperscript𝑃prodP^{\textsc{prod}}italic_P start_POSTSUPERSCRIPT prod end_POSTSUPERSCRIPT +Tsuperscriptsubscript𝑇\mathbb{R}_{+}^{T}blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT kW generated PV power
pprodsuperscript𝑝prodp^{\textsc{prod}}italic_p start_POSTSUPERSCRIPT prod end_POSTSUPERSCRIPT +Tsuperscriptsubscript𝑇\mathbb{R}_{+}^{T}blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT kW normalised PV power
Lpvsuperscript𝐿pvL^{\textsc{pv}}italic_L start_POSTSUPERSCRIPT pv end_POSTSUPERSCRIPT 20 \mathbb{N}blackboard_N years PV lifetime
RpvsubscriptRpv\textsc{R}_{\textsc{pv}}R start_POSTSUBSCRIPT pv end_POSTSUBSCRIPT +subscript\mathbb{R}_{+}blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT - annuity factor
oxpvfixsuperscriptsubscriptoxpvfix\textsc{ox}_{\textsc{pv}}^{\textsc{fix}}ox start_POSTSUBSCRIPT pv end_POSTSUBSCRIPT start_POSTSUPERSCRIPT fix end_POSTSUPERSCRIPT 0 +subscript\mathbb{R}_{+}blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT CHF opex PV fixed cost
oxpvvarsuperscriptsubscriptoxpvvar\textsc{ox}_{\textsc{pv}}^{\textsc{var}}ox start_POSTSUBSCRIPT pv end_POSTSUBSCRIPT start_POSTSUPERSCRIPT var end_POSTSUPERSCRIPT 100 +subscript\mathbb{R}_{+}blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT CHF/kW opex PV variable cost
cxpvfixsuperscriptsubscriptcxpvfix\textsc{cx}_{\textsc{pv}}^{\textsc{fix}}cx start_POSTSUBSCRIPT pv end_POSTSUBSCRIPT start_POSTSUPERSCRIPT fix end_POSTSUPERSCRIPT 100 +subscript\mathbb{R}_{+}blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT CHF capex PV fixed cost
cxpvvarsuperscriptsubscriptcxpvvar\textsc{cx}_{\textsc{pv}}^{\textsc{var}}cx start_POSTSUBSCRIPT pv end_POSTSUBSCRIPT start_POSTSUPERSCRIPT var end_POSTSUPERSCRIPT 775 +subscript\mathbb{R}_{+}blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT CHF/kW capex PV variable cost
Battery B𝐵Bitalic_B +subscript\mathbb{R}_{+}blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT kWh nominal capacity of the battery
soc +Tsuperscriptsubscript𝑇\mathbb{R}_{+}^{T}blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT kWh state of charge of the battery
PBsuperscript𝑃𝐵P^{B}italic_P start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT Tsuperscript𝑇\mathbb{R}^{T}blackboard_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT kW power exchanged with the battery
Bminsubscript𝐵minB_{\textsc{min}}italic_B start_POSTSUBSCRIPT min end_POSTSUBSCRIPT 0 +subscript\mathbb{R}_{+}blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT kWh minimal nominal battery capacity
Bmaxsubscript𝐵maxB_{\textsc{max}}italic_B start_POSTSUBSCRIPT max end_POSTSUBSCRIPT \infty +subscript\mathbb{R}_{+}blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT kWh maximal nominal battery capacity
ηbsuperscript𝜂b\eta^{\textsc{b}}italic_η start_POSTSUPERSCRIPT b end_POSTSUPERSCRIPT 0.9 ]0,1]01\left]0,1\right]] 0 , 1 ] - battery efficiency
Lbsuperscript𝐿bL^{\textsc{b}}italic_L start_POSTSUPERSCRIPT b end_POSTSUPERSCRIPT 10 \mathbb{N}blackboard_N years battery lifetime
RBsubscriptRB\textsc{R}_{\textsc{B}}R start_POSTSUBSCRIPT B end_POSTSUBSCRIPT +subscript\mathbb{R}_{+}blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT - annuity factor
oxbfixsuperscriptsubscriptoxbfix\textsc{ox}_{\textsc{b}}^{\textsc{fix}}ox start_POSTSUBSCRIPT b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT fix end_POSTSUPERSCRIPT 0 +subscript\mathbb{R}_{+}blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT CHF opex Battery fixed cost
oxbvarsuperscriptsubscriptoxbvar\textsc{ox}_{\textsc{b}}^{\textsc{var}}ox start_POSTSUBSCRIPT b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT var end_POSTSUPERSCRIPT 10 +subscript\mathbb{R}_{+}blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT CHF/kW opex Battery variable cost
cxbfixsuperscriptsubscriptcxbfix\textsc{cx}_{\textsc{b}}^{\textsc{fix}}cx start_POSTSUBSCRIPT b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT fix end_POSTSUPERSCRIPT 50 +subscript\mathbb{R}_{+}blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT CHF capex Battery fixed cost
cxbvarsuperscriptsubscriptcxbvar\textsc{cx}_{\textsc{b}}^{\textsc{var}}cx start_POSTSUBSCRIPT b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT var end_POSTSUPERSCRIPT 300 +subscript\mathbb{R}_{+}blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT CHF/kW capex Battery variable cost
EV bevsuperscript𝑏evb^{\textsc{ev}}italic_b start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT +Tsuperscriptsubscript𝑇\mathbb{R}_{+}^{T}blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT - binary indicator of EV presence
Bevsuperscript𝐵evB^{\textsc{ev}}italic_B start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT 80 +subscript\mathbb{R}_{+}blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT kWh maximal nominal EV battery capacity
socevsuperscriptsocev\textsc{soc}^{\textsc{ev}}soc start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT +Tsuperscriptsubscript𝑇\mathbb{R}_{+}^{T}blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT kWh state of charge of the EV battery
socminevsubscriptsuperscriptsocevmin\textsc{soc}^{\textsc{ev}}_{\textsc{min}}soc start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT start_POSTSUBSCRIPT min end_POSTSUBSCRIPT 32 +subscript\mathbb{R}_{+}blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT kWh minimum state of charge of the EV battery
Pevsuperscript𝑃evP^{\textsc{ev}}italic_P start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT +Tsuperscriptsubscript𝑇\mathbb{R}_{+}^{T}blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT kW power exchange with the EV battery
Pmaxevsubscriptsuperscript𝑃evmaxP^{\textsc{ev}}_{\textsc{max}}italic_P start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT start_POSTSUBSCRIPT max end_POSTSUBSCRIPT 5 +subscript\mathbb{R}_{+}blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT kW maximal power exchange with the EV battery
ηevsuperscript𝜂ev\eta^{\textsc{ev}}italic_η start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT 1 ]0,1]01\left]0,1\right]] 0 , 1 ] - EV battery efficiency
Cevimpsubscriptsuperscript𝐶impevC^{\textsc{imp}}_{\textsc{ev}}italic_C start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ev end_POSTSUBSCRIPT -1.5 \mathbb{R}blackboard_R CHF/kWh imported electricity price from the EV battery
Cevexpsubscriptsuperscript𝐶expevC^{\textsc{exp}}_{\textsc{ev}}italic_C start_POSTSUPERSCRIPT exp end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ev end_POSTSUBSCRIPT 1 \mathbb{R}blackboard_R CHF/kWh exported electricity price to the EV battery
System T𝑇Titalic_T \mathbb{N}blackboard_N - time horizon
ΔtΔ𝑡\Delta troman_Δ italic_t 1 +subscript\mathbb{R}_{+}blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT h time steps
r𝑟ritalic_r 0.05 \mathbb{R}blackboard_R - discount rate
Ploadsuperscript𝑃loadP^{\textsc{load}}italic_P start_POSTSUPERSCRIPT load end_POSTSUPERSCRIPT +Tsuperscriptsubscript𝑇\mathbb{R}_{+}^{T}blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT kW uncontrollable electricity consumption
Table 3: Set of constants and parameters of the building-scale PV-battery system studied.

The State Space of the system can be fully described by

stsubscript𝑠𝑡\displaystyle s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =(ht,dt,soct,Ptprod,Ptload,Cgrid,timp,Cgrid,texp,btev,soctev)𝒮absentsubscript𝑡subscript𝑑𝑡subscriptsoc𝑡subscriptsuperscript𝑃prod𝑡subscriptsuperscript𝑃load𝑡subscriptsuperscript𝐶impgrid𝑡subscriptsuperscript𝐶expgrid𝑡subscriptsuperscript𝑏ev𝑡subscriptsuperscriptsocev𝑡𝒮\displaystyle=(h_{t},d_{t},\textsc{soc}_{t},P^{\textsc{prod}}_{t},P^{\textsc{% load}}_{t},C^{\textsc{imp}}_{\textsc{grid},t},C^{\textsc{exp}}_{\textsc{grid},% t},b^{\textsc{ev}}_{t},\textsc{soc}^{\textsc{ev}}_{t})\in\mathcal{S}= ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , soc start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT prod end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT load end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_C start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT start_POSTSUBSCRIPT grid , italic_t end_POSTSUBSCRIPT , italic_C start_POSTSUPERSCRIPT exp end_POSTSUPERSCRIPT start_POSTSUBSCRIPT grid , italic_t end_POSTSUBSCRIPT , italic_b start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , soc start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∈ caligraphic_S (2)
  • ht{0,,23}subscript𝑡023h_{t}\in\{0,...,23\}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { 0 , … , 23 } denotes the hour of the day at time t𝑡titalic_t.

  • dt{0,,364}subscript𝑑𝑡0364d_{t}\in\{0,...,364\}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { 0 , … , 364 } denotes the day of the year at time t𝑡titalic_t.

  • soct[0,B]subscriptsoc𝑡0𝐵\textsc{soc}_{t}\in[0,B]soc start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ [ 0 , italic_B ] is the state of charge of the battery at time t𝑡titalic_t, this value is upper bounded by the nominal capacity of the installed battery B.

  • Ptprod+subscriptsuperscript𝑃prod𝑡subscriptP^{\textsc{prod}}_{t}\in\mathbb{R}_{+}italic_P start_POSTSUPERSCRIPT prod end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT represents the expected PV power at time t𝑡titalic_t. This value is obtained by scaling normalized historical data ptprodsubscriptsuperscript𝑝prod𝑡p^{\textsc{prod}}_{t}italic_p start_POSTSUPERSCRIPT prod end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with the design of PV power (Pnomsuperscript𝑃nomP^{\textsc{nom}}italic_P start_POSTSUPERSCRIPT nom end_POSTSUPERSCRIPT) and considering htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and dtsubscript𝑑𝑡d_{t}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT values.

  • Ptload+subscriptsuperscript𝑃load𝑡subscriptP^{\textsc{load}}_{t}\in\mathbb{R}_{+}italic_P start_POSTSUPERSCRIPT load end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT denotes the expected value of the electrical load at time t𝑡titalic_t. The load profile is determined using historical data that corresponds to the same hour and day as the PV power.

  • Cgrid,timpsubscriptsuperscript𝐶impgrid𝑡C^{\textsc{imp}}_{\textsc{grid},t}\in\mathbb{R}italic_C start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT start_POSTSUBSCRIPT grid , italic_t end_POSTSUBSCRIPT ∈ blackboard_R represents the cost per unit of electricity imported from the grid at time t𝑡titalic_t. This value is dynamically determined from a predefined dataset.

  • Cgrid,texpsubscriptsuperscript𝐶expgrid𝑡C^{\textsc{exp}}_{\textsc{grid},t}\in\mathbb{R}italic_C start_POSTSUPERSCRIPT exp end_POSTSUPERSCRIPT start_POSTSUBSCRIPT grid , italic_t end_POSTSUBSCRIPT ∈ blackboard_R corresponds to the compensation received per unit of electricity exported to the grid at time t𝑡titalic_t. Like the import costs, this value is derived from a dataset.

  • btev{0,1}subscriptsuperscript𝑏ev𝑡01b^{\textsc{ev}}_{t}\in\{0,1\}italic_b start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { 0 , 1 } is a binary indicator indicating whether a bidirectional EV is present at the charging station at time t𝑡titalic_t. This state affects the potential for energy storage or retrieval from the EV’s battery, thereby influencing the overall energy management strategy. The value is updated according to usage patterns captured in the dataset.

  • soctev[socminev,Bev]subscriptsuperscriptsocev𝑡subscriptsuperscriptsocevminsuperscript𝐵ev\textsc{soc}^{\textsc{ev}}_{t}\in[\textsc{soc}^{\textsc{ev}}_{\textsc{min}},B^% {\textsc{ev}}]soc start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ [ soc start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT start_POSTSUBSCRIPT min end_POSTSUBSCRIPT , italic_B start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT ] specifies the current charge level of the EV’s battery, when present. This value ranges between 40 % of Bevsuperscript𝐵evB^{\textsc{ev}}italic_B start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT and Bevsuperscript𝐵evB^{\textsc{ev}}italic_B start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT when the EV is connected, and is set to zero when no EV is present. The charge level is initialised randomly based on probable starting conditions and adjusted according to actual charging and discharging activities dictated by the control policy and EV usage scenarios from the dataset.

The Action Space comprises the power exchanged with the stationary battery and the EV’s battery when present. Positive values indicate discharging, and negative values represent charging. The continuous action space is defined as:

atsubscript𝑎𝑡\displaystyle a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =(P~tB,P~tev)𝒜=[BΔt,BΔt]×[Pmaxev,Pmaxev]absentsuperscriptsubscript~𝑃𝑡𝐵superscriptsubscript~𝑃𝑡ev𝒜𝐵Δ𝑡𝐵Δ𝑡subscriptsuperscript𝑃evmaxsubscriptsuperscript𝑃evmax\displaystyle=(\widetilde{P}_{t}^{B},\widetilde{P}_{t}^{\textsc{ev}})\in% \mathcal{A}=[-\frac{B}{\Delta t},\frac{B}{\Delta t}]\times[-P^{\textsc{ev}}_{% \textsc{max}},P^{\textsc{ev}}_{\textsc{max}}]= ( over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT , over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT ) ∈ caligraphic_A = [ - divide start_ARG italic_B end_ARG start_ARG roman_Δ italic_t end_ARG , divide start_ARG italic_B end_ARG start_ARG roman_Δ italic_t end_ARG ] × [ - italic_P start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT start_POSTSUBSCRIPT max end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ] (3)

The Initial Distribution set the initial state as follows. The hour h0subscript0h_{0}italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is set to 0. During the training process, the initial day d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is randomly selected, whereas for the validation process, d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is set to the earliest date within the year. The initial soctsubscriptsoc𝑡\textsc{soc}_{t}soc start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is randomly determined during training and set to half of the battery capacity B𝐵Bitalic_B during validation. All other initial state values are derived from an predefined input dataset based on the corresponding initial hour and day.

The Transition Probability becomes a transition function, as there is no randomness involved. This function updates the system state at each hourly time step.

The hour of the day htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT increments each hour, and the day dtsubscript𝑑𝑡d_{t}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT increments every 24 hours:

ht+1subscript𝑡1\displaystyle h_{t+1}italic_h start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT =(ht+1) mod 24absentsubscript𝑡1 mod 24\displaystyle=(h_{t}+1)\text{ mod }24= ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 1 ) mod 24 (4)
dt+1subscript𝑑𝑡1\displaystyle d_{t+1}italic_d start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT =Int(ht+124)absentIntsubscript𝑡124\displaystyle=\text{Int}(\frac{h_{t}+1}{24})= Int ( divide start_ARG italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 1 end_ARG start_ARG 24 end_ARG ) (5)

where the function Int𝐼𝑛𝑡Intitalic_I italic_n italic_t takes the integer value of the expression.

The state of charge for both the stationary battery soctsubscriptsoc𝑡\textsc{soc}_{t}soc start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the EV’s battery soctevsubscriptsuperscriptsocev𝑡\textsc{soc}^{\textsc{ev}}_{t}soc start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are updated based on the respective power actions P~tBsuperscriptsubscript~𝑃𝑡𝐵\widetilde{P}_{t}^{B}over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT and P~tevsuperscriptsubscript~𝑃𝑡ev\widetilde{P}_{t}^{\textsc{ev}}over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT. These actions specify the power to be charged or discharged from the batteries over one hour (Δt=1hΔ𝑡1\Delta t=1hroman_Δ italic_t = 1 italic_h). However, the actual power exchanged is constrained either by the battery capacity when charging it or by the energy stored in the battery when discharging it.

PtB={BsoctΔt if P~tB>BsoctΔtsoctΔt if P~tB<soctΔtPtBotherwisesubscriptsuperscript𝑃𝐵𝑡casesBsubscriptsoc𝑡Δ𝑡 if subscriptsuperscript~𝑃𝐵𝑡Bsubscriptsoc𝑡Δ𝑡subscriptsoc𝑡Δ𝑡 if subscriptsuperscript~𝑃𝐵𝑡subscriptsoc𝑡Δ𝑡subscriptsuperscript𝑃𝐵𝑡otherwise\displaystyle P^{B}_{t}=\begin{cases}\frac{\textsc{B}-\textsc{soc}_{t}}{\Delta t% }&\text{ if }\widetilde{P}^{B}_{t}>\frac{\textsc{B}-\textsc{soc}_{t}}{\Delta t% }\\ \frac{\textsc{soc}_{t}}{\Delta t}&\text{ if }\widetilde{P}^{B}_{t}<-\frac{% \textsc{soc}_{t}}{\Delta t}\\ P^{B}_{t}&\text{otherwise}\end{cases}italic_P start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { start_ROW start_CELL divide start_ARG B - soc start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG roman_Δ italic_t end_ARG end_CELL start_CELL if over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT > divide start_ARG B - soc start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG roman_Δ italic_t end_ARG end_CELL end_ROW start_ROW start_CELL divide start_ARG soc start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG roman_Δ italic_t end_ARG end_CELL start_CELL if over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT < - divide start_ARG soc start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG roman_Δ italic_t end_ARG end_CELL end_ROW start_ROW start_CELL italic_P start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL otherwise end_CELL end_ROW (6)

Similarly for the EV’s battery:

Ptevsubscriptsuperscript𝑃ev𝑡\displaystyle P^{\textsc{ev}}_{t}italic_P start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ={BevsoctevΔt if P~tev>BevsoctevΔtsoctevΔt if P~tev<40%BtevΔtPtevotherwiseabsentcasessuperscript𝐵evsubscriptsuperscriptsocev𝑡Δ𝑡 if subscriptsuperscript~𝑃ev𝑡superscript𝐵evsuperscriptsubscriptsoc𝑡evΔ𝑡subscriptsuperscriptsocev𝑡Δ𝑡 if subscriptsuperscript~𝑃ev𝑡percent40subscriptsuperscript𝐵ev𝑡Δ𝑡subscriptsuperscript𝑃ev𝑡otherwise\displaystyle=\begin{cases}\frac{B^{\textsc{ev}}-\textsc{soc}^{\textsc{ev}}_{t% }}{\Delta t}&\text{ if }\widetilde{P}^{\textsc{ev}}_{t}>\frac{B^{\textsc{ev}}-% \textsc{soc}_{t}^{\textsc{ev}}}{\Delta t}\\ \frac{\textsc{soc}^{\textsc{ev}}_{t}}{\Delta t}&\text{ if }\widetilde{P}^{% \textsc{ev}}_{t}<-\frac{40\%\cdot B^{\textsc{ev}}_{t}}{\Delta t}\\ P^{\textsc{ev}}_{t}&\text{otherwise}\end{cases}= { start_ROW start_CELL divide start_ARG italic_B start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT - soc start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG roman_Δ italic_t end_ARG end_CELL start_CELL if over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT > divide start_ARG italic_B start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT - soc start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT end_ARG start_ARG roman_Δ italic_t end_ARG end_CELL end_ROW start_ROW start_CELL divide start_ARG soc start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG roman_Δ italic_t end_ARG end_CELL start_CELL if over~ start_ARG italic_P end_ARG start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT < - divide start_ARG 40 % ⋅ italic_B start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG roman_Δ italic_t end_ARG end_CELL end_ROW start_ROW start_CELL italic_P start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL otherwise end_CELL end_ROW (7)
withPmaxevP~tevPmaxevwithsubscriptsuperscript𝑃evmaxsuperscriptsubscript~𝑃𝑡evsubscriptsuperscript𝑃evmax\displaystyle\text{with}-P^{\textsc{ev}}_{\textsc{max}}\leq\widetilde{P}_{t}^{% \textsc{ev}}\leq P^{\textsc{ev}}_{\textsc{max}}with - italic_P start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ≤ over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT ≤ italic_P start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT start_POSTSUBSCRIPT max end_POSTSUBSCRIPT (8)

Using these power exchanges, the state of charge for the next time step is calculated as:

soct+1subscriptsoc𝑡1\displaystyle\textsc{soc}_{t+1}soc start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT =soct+PtBΔt(ηb if PtB0 else 1ηb)absentsubscriptsoc𝑡subscriptsuperscript𝑃𝐵𝑡Δ𝑡superscript𝜂b if subscriptsuperscript𝑃𝐵𝑡0 else 1superscript𝜂b\displaystyle=\textsc{soc}_{t}+P^{B}_{t}\cdot\Delta t\cdot(\eta^{\textsc{b}}% \text{ if }P^{B}_{t}\geq 0\text{ else }\frac{1}{\eta^{\textsc{b}}})= soc start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_P start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ roman_Δ italic_t ⋅ ( italic_η start_POSTSUPERSCRIPT b end_POSTSUPERSCRIPT if italic_P start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ 0 else divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUPERSCRIPT b end_POSTSUPERSCRIPT end_ARG ) (9)
soct+1evsubscriptsuperscriptsocev𝑡1\displaystyle\textsc{soc}^{\textsc{ev}}_{t+1}soc start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT =soctev+PtevΔt(ηev if Ptev0 else 1ηev)absentsubscriptsuperscriptsocev𝑡subscriptsuperscript𝑃ev𝑡Δ𝑡superscript𝜂ev if subscriptsuperscript𝑃ev𝑡0 else 1superscript𝜂ev\displaystyle=\textsc{soc}^{\textsc{ev}}_{t}+P^{\textsc{ev}}_{t}\cdot\Delta t% \cdot(\eta^{\textsc{ev}}\text{ if }P^{\textsc{ev}}_{t}\geq 0\text{ else }\frac% {1}{\eta^{\textsc{ev}}})= soc start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_P start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ roman_Δ italic_t ⋅ ( italic_η start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT if italic_P start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ 0 else divide start_ARG 1 end_ARG start_ARG italic_η start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT end_ARG ) (10)

where ηbsuperscript𝜂b\eta^{\textsc{b}}italic_η start_POSTSUPERSCRIPT b end_POSTSUPERSCRIPT and ηevsuperscript𝜂ev\eta^{\textsc{ev}}italic_η start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT are respectively the efficiency of the battery and EV’s battery.

The Reward Function quantifies the system’s performance by incorporating economic factors that include investment cost (capex), operating cost (opex), and costs associated with the purchase and sale of electricity from the grid. The reward at each time step t𝑡titalic_t is calculated as the negative total expenditure (totex):

rtsubscript𝑟𝑡\displaystyle r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =totextabsentsubscripttotex𝑡\displaystyle=-\textsc{totex}_{t}= - totex start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (11)
=(capex+opex+Cgrid,t)absentcapexopexsubscript𝐶grid𝑡\displaystyle=-(\textsc{capex}+\textsc{opex}+C_{\textsc{grid},t})= - ( capex + opex + italic_C start_POSTSUBSCRIPT grid , italic_t end_POSTSUBSCRIPT ) (12)
=(capex+opex+PtimpCgrid,timpPtexpCgrid,texp)absentcapexopexsubscriptsuperscript𝑃imp𝑡subscriptsuperscript𝐶impgrid𝑡subscriptsuperscript𝑃exp𝑡subscriptsuperscript𝐶expgrid𝑡\displaystyle=-(\textsc{capex}+\textsc{opex}+P^{\textsc{imp}}_{t}\cdot C^{% \textsc{imp}}_{\textsc{grid},t}-P^{\textsc{exp}}_{t}\cdot C^{\textsc{exp}}_{% \textsc{grid},t})= - ( capex + opex + italic_P start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ italic_C start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT start_POSTSUBSCRIPT grid , italic_t end_POSTSUBSCRIPT - italic_P start_POSTSUPERSCRIPT exp end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ italic_C start_POSTSUPERSCRIPT exp end_POSTSUPERSCRIPT start_POSTSUBSCRIPT grid , italic_t end_POSTSUBSCRIPT ) (13)

where Cgrid,tsubscript𝐶grid𝑡C_{\textsc{grid},t}italic_C start_POSTSUBSCRIPT grid , italic_t end_POSTSUBSCRIPT represents the net cost of electricity exchanged with the grid at time t𝑡titalic_t.

The total cost (totex) includes:

totex =opex+capex+Cgridabsentopexcapexsubscript𝐶grid\displaystyle=\textsc{opex}+\textsc{capex}+C_{\textsc{grid}}= opex + capex + italic_C start_POSTSUBSCRIPT grid end_POSTSUBSCRIPT (14)

Operating costs (opex) and capital expenditure (capex) are defined for both PV and battery design parameters as:

opex =oxpv+oxBabsentsubscriptoxpvsubscriptoxB\displaystyle=\textsc{ox}_{\textsc{pv}}+\textsc{ox}_{\textsc{B}}= ox start_POSTSUBSCRIPT pv end_POSTSUBSCRIPT + ox start_POSTSUBSCRIPT B end_POSTSUBSCRIPT (15)
capex =cxpvRpv+cxBRBabsentsubscriptcxpvsubscript𝑅𝑝𝑣subscriptcxBsubscript𝑅𝐵\displaystyle=\textsc{cx}_{\textsc{pv}}\cdot R_{pv}+\textsc{cx}_{\textsc{B}}% \cdot R_{B}= cx start_POSTSUBSCRIPT pv end_POSTSUBSCRIPT ⋅ italic_R start_POSTSUBSCRIPT italic_p italic_v end_POSTSUBSCRIPT + cx start_POSTSUBSCRIPT B end_POSTSUBSCRIPT ⋅ italic_R start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT (16)

where Rpvsubscript𝑅pvR_{\textsc{pv}}italic_R start_POSTSUBSCRIPT pv end_POSTSUBSCRIPT and RBsubscript𝑅𝐵R_{B}italic_R start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT are annuity factors adjusting the capex for the lifetime of the system’s components, considering their financial amortisation over a finite period T𝑇Titalic_T.

The annuity factor R𝑅Ritalic_R is derived as follows to prorate the capex over the operational duration T𝑇Titalic_T, acknowledging T𝑇Titalic_T in hours and 8760 as the number of hours in a year:

R=r(1+r)L(1+r)L1T8760𝑅𝑟superscript1𝑟𝐿superscript1𝑟𝐿1𝑇8760\displaystyle R=\frac{r\cdot(1+r)^{L}}{(1+r)^{L}-1}\cdot\frac{T}{8760}italic_R = divide start_ARG italic_r ⋅ ( 1 + italic_r ) start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 + italic_r ) start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT - 1 end_ARG ⋅ divide start_ARG italic_T end_ARG start_ARG 8760 end_ARG (17)

This factor is calculated using the annual discount rate r𝑟ritalic_r and the expected lifetime L𝐿Litalic_L of the components, thereby aligning the investment costs proportionally to the duration T𝑇Titalic_T of the optimisation horizon.

Table 4: Synthetic dataset of electricity pricing and EV arrival time of the building-scale PV-battery system studied.
h Cgridexpsubscriptsuperscript𝐶expgridC^{\textsc{exp}}_{\textsc{grid}}italic_C start_POSTSUPERSCRIPT exp end_POSTSUPERSCRIPT start_POSTSUBSCRIPT grid end_POSTSUBSCRIPT Cgridimpsubscriptsuperscript𝐶impgridC^{\textsc{imp}}_{\textsc{grid}}italic_C start_POSTSUPERSCRIPT imp end_POSTSUPERSCRIPT start_POSTSUBSCRIPT grid end_POSTSUBSCRIPT Probability of EV arrival time
[-] [CHF/kWh] [CHF/kWh] [-]
0 0 -0.3 0
1 0 -0.3 0
2 0 -0.3 0
3 0 -0.3 0
4 0 -0.3 0
5 0 -0.3 0
6 0 -0.5 0
7 0 -0.5 0.75
8 0 -0.5 0.9
9 0 -0.5 0.9
10 0 -0.3 0.75
11 0 -0.3 0.1
12 0 -0.3 0.1
13 0 -0.3 0.1
14 0 -0.3 0
15 0 -0.3 0
16 0 -0.5 0
17 0 -0.5 0
18 0 -0.5 0
19 0 -0.5 0
20 0 -0.5 0
21 0 -0.5 0
22 0 -0.3 0
23 0 -0.3 0
Refer to caption
Figure 4: Visualisation of the historical dataset covering a year of the building electricity consumption and its normalised PV production. The white background indicates the training set, while the grey background represents the validation dataset.

The Optimisation Horizon refers to the period over which the system is optimised, corresponding to the duration of an episode. In this model, each time step of the MDP represents a single hour, and the horizon is truncated after T=168𝑇168T=168italic_T = 168 hours, equivalent to one week. Long-term dependencies are captured through the bootstrap** method used to train the critic. However, the time horizon ideally would span an entire year, or even the full lifecycle of the system to capture seasonal fluctuations in production and consumption, as well as potential equipment degradation. To assess performance over such a longer time horizon, the performances are regularly evaluated during the training phase across the full training dataset, corresponding to T=8088𝑇8088T=8088italic_T = 8088.

The Historical Datasets of the system are detailed in Table 3. The historical data for the normalised PV production and the electrical consumption are derived from real monitoring of an office building in Switzerland in 2021, as shown in Figure 4. This dataset is divided into training and validation parts, each selected to represent the seasonal fluctuations. The dataset used for the electricity prices supplied to and from the grid, as well as the arrival times of the EV, are synthetically generated and summarised in Table 4. The grid export cost, Cgridexpsubscriptsuperscript𝐶expgridC^{\textsc{exp}}_{\textsc{grid}}italic_C start_POSTSUPERSCRIPT exp end_POSTSUPERSCRIPT start_POSTSUBSCRIPT grid end_POSTSUBSCRIPT, is set to 0 at all times to discourage making money by reselling PV production and to maximise self-consumption. The duration of the EV’s presence is randomly varied between 5 and 8 hours, and the initial state of charge (SoC) of the EV is randomly set between 40 % and 100 % of its battery capacity, Bevsuperscript𝐵evB^{\textsc{ev}}italic_B start_POSTSUPERSCRIPT ev end_POSTSUPERSCRIPT.