License: CC BY 4.0
arXiv:2312.06527v1 [cs.AI] 11 Dec 2023

Can Reinforcement Learning support policy makers? A preliminary study with Integrated Assessment Models

Theodore Wolf
University College London | Carbon Re
[email protected]
&Nantas Nardelli
Carbon Re
[email protected]
&John Shawe-Taylor
University College London
[email protected]
&María Pérez Ortiz
University College London
[email protected]
Abstract

Governments around the world aspire to ground decision-making on evidence. Many of the foundations of policy making — e.g. sensing patterns that relate to societal needs, develo** evidence-based programs, forecasting potential outcomes of policy changes, and monitoring effectiveness of policy programs — have the potential to benefit from the use of large-scale datasets or simulations together with intelligent algorithms. These could, if designed and deployed in a way that is well grounded on scientific evidence, enable a more comprehensive, faster, and rigorous approach to policy making. Integrated Assessment Models (IAM) is a broad umbrella covering scientific models that attempt to link main features of society and economy with the biosphere into one modelling framework. At present, these systems are probed by policy makers and advisory groups in a hypothesis-driven manner. In this paper, we empirically demonstrate that modern Reinforcement Learning can be used to probe IAMs and explore the space of solutions in a more principled manner. While the implication of our results are modest since the environment is simplistic, we believe that this is a step** stone towards more ambitious use cases, which could allow for effective exploration of policies and understanding of their consequences and limitations.

1 Introduction

Climate is a high dimensional dynamical system with strong inter-dependent components and long time dependencies, all of which interact to produce highly non-linear responses and behavior. Climate is also highly conditioned on human behavior – another greatly complex system – such that is now necessary to reason about climate change from a socio-climatic perspective [Moore et al., 2022]. To make progress towards achieving some kind of solution in the face of extreme consequences, policymakers and advisory groups employ Integrated Assessment Models (IAMs), state of the art models for climate change that combine knowledge about human development (such as economical theories) together with planetary sciences such as ecology and geophysics [Parson and Fisher-Vanden, 1997]. Exploring and analysing the properties of the IAMs employed for large scale assessments [Pörtner et al., 2022] - to e.g. measure fidelity against the real world – is generally intractable from a computational perspective, which leads researchers to implement poor simplifying assumptions and decrease their effectiveness [Asefi-Najafabady et al., 2021]. Smaller IAM models aim to provide an alternative by employing fewer state variables and simpler sets of dynamics, making them amenable to mathematical probing and analysis [Kittel et al., 2017, Nitzbon et al., 2017]. The literature commonly explores these models with ODE solvers; however recently [Strnad et al., 2019] has shown that it is possible to employ them as environments in standard Reinforcement Learning (RL)  [Sutton and Barto, 2020], and explore models using trained policies. These can be used to understand the system, and provide upwards insight towards improving more complex IAMs or even our understanding of climate change policies.

We build on this work, testing more RL algorithms, reward functions, as well as different experimental setups. Among others, we show that (a) modern RL can learn effective policies with a variety of reward functions in this environment (b) that different agents and reward functions generate a significantly diverse set of solutions, thus exploring the IAM in different manners, and (c) that it is necessary to apply care when designing reward functions as they show different success rate in reaching the desirable state for different initialisation points, finally - (d) that RL helps us gain a deeper understanding of the properties and limitations of the applied models.

2 AYS environment, RL reward functions and agents

We employ the AYS model [Kittel et al., 2017], which we use to create a Markov Decision Process following Strnad et al. [2019]. AYS is governed by three coupled differential equations:

dAdt=11+(S/σ)ρYϕϵA/τA,dYdt=βYθAY,dSdt=(111+(S/σ)ρ)YϵS/τS,formulae-sequence𝑑𝐴𝑑𝑡11superscript𝑆𝜎𝜌𝑌italic-ϕitalic-ϵ𝐴subscript𝜏𝐴formulae-sequence𝑑𝑌𝑑𝑡𝛽𝑌𝜃𝐴𝑌𝑑𝑆𝑑𝑡111superscript𝑆𝜎𝜌𝑌italic-ϵ𝑆subscript𝜏𝑆\frac{dA}{dt}=\frac{1}{1+(S/\sigma)^{\rho}}\frac{Y}{\phi\epsilon}-A/\tau_{A},% \quad\frac{dY}{dt}=\beta Y-\theta AY,\quad\frac{dS}{dt}=\left(1-\frac{1}{1+(S/% \sigma)^{\rho}}\right)\frac{Y}{\epsilon}-S/\tau_{S},divide start_ARG italic_d italic_A end_ARG start_ARG italic_d italic_t end_ARG = divide start_ARG 1 end_ARG start_ARG 1 + ( italic_S / italic_σ ) start_POSTSUPERSCRIPT italic_ρ end_POSTSUPERSCRIPT end_ARG divide start_ARG italic_Y end_ARG start_ARG italic_ϕ italic_ϵ end_ARG - italic_A / italic_τ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , divide start_ARG italic_d italic_Y end_ARG start_ARG italic_d italic_t end_ARG = italic_β italic_Y - italic_θ italic_A italic_Y , divide start_ARG italic_d italic_S end_ARG start_ARG italic_d italic_t end_ARG = ( 1 - divide start_ARG 1 end_ARG start_ARG 1 + ( italic_S / italic_σ ) start_POSTSUPERSCRIPT italic_ρ end_POSTSUPERSCRIPT end_ARG ) divide start_ARG italic_Y end_ARG start_ARG italic_ϵ end_ARG - italic_S / italic_τ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , (1)

where A𝐴Aitalic_A is the excess atmospheric carbon, Y𝑌Yitalic_Y society’s economic output, and S𝑆Sitalic_S the renewable knowledge stock. At each step t𝑡titalic_t, the agent observes a vector (At,Yt,St)subscript𝐴𝑡subscript𝑌𝑡subscript𝑆𝑡(A_{t},Y_{t},S_{t})( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), which is a partial representation of the state of the system. This dynamical system contains two attractors, a green fixed point, where renewable knowledge and economic output grow forever, and a black fixed point, where the economic output is stagnant and there is a large amount of excess carbon in the atmosphere (Figure 1). Broadly speaking, the former is a positive end state (and effectively encodes the high level goal), while the latter is undesirable. See Appendix A for a more detailed technical aspects of the environment. The agent can take four actions, each corresponding to a high level government policy decision and changes to the dynamics of the environment (See Table 1 for a mathematical and textual description of the actions).

Reward functions

The reward signal that agents receives from taking actions in this environment is based on planetary boundaries (PBs), quantitative physical and economical limits which, if crossed, would represent disastrous and irreversible consequences for the biosphere and humans [Rockström et al., 2009, Steffen et al., 2015].

Refer to caption
Figure 1: Phase space of the environment, each dimension corresponds to one of the states dimensions. Hair lines show the flows of the system.

We define these boundaries as APB=600subscript𝐴𝑃𝐵600A_{PB}=600italic_A start_POSTSUBSCRIPT italic_P italic_B end_POSTSUBSCRIPT = 600, YPB=4×1013subscript𝑌𝑃𝐵4superscript1013Y_{PB}=4\times 10^{13}italic_Y start_POSTSUBSCRIPT italic_P italic_B end_POSTSUBSCRIPT = 4 × 10 start_POSTSUPERSCRIPT 13 end_POSTSUPERSCRIPT, and SPB=0subscript𝑆𝑃𝐵0S_{PB}=0italic_S start_POSTSUBSCRIPT italic_P italic_B end_POSTSUBSCRIPT = 0, which form the triplet sPBsubscript𝑠𝑃𝐵s_{PB}italic_s start_POSTSUBSCRIPT italic_P italic_B end_POSTSUBSCRIPT. Crossing one of these boundaries yields a reward of 0 and ends the episode. Furthermore, in the standard scenario, the agent is rewarded for staying away from these boundaries, i.e. RtPB=stsPB2superscriptsubscript𝑅𝑡𝑃𝐵superscriptnormsubscript𝑠𝑡subscript𝑠𝑃𝐵2R_{t}^{PB}=||s_{t}-s_{PB}||^{2}italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P italic_B end_POSTSUPERSCRIPT = | | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_P italic_B end_POSTSUBSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. This incentives the agent to maximise economic output while minimising carbon emissions. We call this reward function PB reward. A second reward function we employ is the Policy Cost (PC) reward, which adds an action-dependent cost to PB reward, simulating the real world cost of implementing and maintaining any significant shift in policy in a running socio-economical system. E.g., throttling growth over years may be challenging for a policymaker [Keyßer and Lenzen, 2021]. Finally, our experiments see the use of a third and simpler reward function (Sparse reward, since it significantly lowers the amount of feedback

that the agent receives on average), which only considers whether the agent reaches the goal or hits any of the planetary boundaries:

r(s)={1ifst=sg1if(At>APB)(Yt<YPB)0otherwise.𝑟𝑠cases1ifsubscript𝑠𝑡subscript𝑠𝑔𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒1ifsubscript𝐴𝑡subscript𝐴𝑃𝐵subscript𝑌𝑡subscript𝑌𝑃𝐵𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒0otherwise𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒\displaystyle r(s)=\begin{cases}1\quad\textrm{if}\quad s_{t}=s_{g}\\ -1\quad\textrm{if}\quad(A_{t}>A_{PB})\vee(Y_{t}<Y_{PB})\\ 0\quad\textrm{otherwise}.\end{cases}italic_r ( italic_s ) = { start_ROW start_CELL 1 if italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL - 1 if ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT > italic_A start_POSTSUBSCRIPT italic_P italic_B end_POSTSUBSCRIPT ) ∨ ( italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT < italic_Y start_POSTSUBSCRIPT italic_P italic_B end_POSTSUBSCRIPT ) end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 0 otherwise . end_CELL start_CELL end_CELL end_ROW (2)
Table 1: Environment actions, and how they relate to the policy cost (PC) reward function.
Action Parameter Change Explanation PC reward
Noop None Environment evolves with default parameters. RtPBsuperscriptsubscript𝑅𝑡𝑃𝐵R_{t}^{PB}italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P italic_B end_POSTSUPERSCRIPT
ET σσ/ρ𝜎𝜎𝜌\sigma\leftarrow\sigma/\sqrt{\rho}italic_σ ← italic_σ / square-root start_ARG italic_ρ end_ARG Halves relative cost of renewables to fossil fuels. 0.5×RtPB0.5superscriptsubscript𝑅𝑡𝑃𝐵0.5\times R_{t}^{PB}0.5 × italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P italic_B end_POSTSUPERSCRIPT
DG ββ/2𝛽𝛽2\beta\leftarrow\beta/2italic_β ← italic_β / 2 Halves rate of growth of the economy. 0.5×RtPB0.5superscriptsubscript𝑅𝑡𝑃𝐵0.5\times R_{t}^{PB}0.5 × italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P italic_B end_POSTSUPERSCRIPT
ET+DG ββ/2𝛽𝛽2\beta\leftarrow\beta/2italic_β ← italic_β / 2, σσ/ρ𝜎𝜎𝜌\sigma\leftarrow\sigma/\sqrt{\rho}italic_σ ← italic_σ / square-root start_ARG italic_ρ end_ARG Combination of the two above actions. 0.25×RtPB0.25superscriptsubscript𝑅𝑡𝑃𝐵0.25\times R_{t}^{PB}0.25 × italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P italic_B end_POSTSUPERSCRIPT

Agent settings

We are primarily interested in the interplay of RL and AYS. To do so, we learn a diverse set of agents through four different learning algorithms: A2C [Mnih et al., 2016], PPO [Schulman et al., 2017], DQN [Mnih et al., 2015], and Double DQN with dueling networks and prioritized experience replay (D3QN) [Hessel et al., 2017]. All agents employ the same network architecture, bar the final layer, which is dependent on the number and type of heads required by the algorithm. See Appendix B for further details on the network architecture. Agents are trained for 5e5 steps, and tuned separately for best cumulative performance using Bayesian optimization. All experiments show results of 3 random seeds.

3 Results and discussion

Figure 2 shows how most agents are generally able to learn a policy that optimizes the relevant reward function. Interestingly, while DQN and D3QN are broadly consistent across reward functions, A2C struggles to learn policies under the same episodic budget, even under a substantial amount of hyperparameter tuning. PPO on the other hand beats all agents on PB reward, and underperforms everywhere else.

Refer to caption
Figure 2: Average cumulative training rewards for each agent by reward function.

We hypothesize these differences may be related to the use of experience replay for the DQN agents, as well as the action sampling method in the case of A2C and PPO. We will later see that a robust class of successful policies in this environment corresponds to getting as quickly as possible to a spot under a green "current", and letting the system converge to the goal. In such scenario, the action distribution for large subsets of the state space needs to converge onto specific actions (such as noop in this particular case). This is a straightforward outcome for DQN-based agents, however as PPO relies on softmax parametrization to output actions (and learn) in a discrete space, it becomes more difficult for the agent to converge to such a policy given its dependency on both entropy regularization and Boltzmann-based exploration [Ahmed et al., 2019, Mei et al., 2020].

Refer to caption
Refer to caption
Refer to caption
Figure 3: Sample trajectories of the four agents initialized from a fixed state with 3 reward functions. The PB reward function helps the agents solve the environment more consistently.

Figure 3 shows that agents converge to significantly different policies under different learning algorithms and reward functions. We note that for a significant part of the initial state space, the first few actions for all the agents that reach the green point are generally similar.

Refer to caption
Refer to caption
Refer to caption
Figure 4: Sample cumulative relative share of actions taken by DQN agents trained with different reward functions.

In Figure 4, we see that agents that reach the green point prefer picking the action ET+DG during the first 23 steps of the episode, and then go on to exhibiting more variety later in the episode. We notice the significant difference between the distribution of actions for the different reward functions. As expected under the PC reward, the agent maximises the number of noop actions taken in order to collect the highest reward. What is important to note is that the agent prefers taking on the cost of taking actions in the beginning of the episode. More generally, we see that the distribution of the first 50 actions taken are the same across all reward functions (75% DG+ET and 25% ET).

This implies that the environment has some form of natural bottleneck, which can also explain the loss in general performance for some of the agents in the sparse reward case, since agents don’t have a state or dynamics-driven exploration system [Burda et al., 2018]. Ultimately, we see that the different reward functions yield qualitative changes in the resulting trained policy of the agents, with a generally higher difference for on-policy agents.

Refer to caption
Refer to caption
Refer to caption
Figure 5: Average success rate of three seeds of the trained DQN agent for different reward functions given different initialisation states. Success is defined as reaching the green point.

Figure 5 shows how agents solve the environment from different initial states when trained with different reward functions. The experiments suggest that policies are indeed highly dependent on the employed learning algorithm, but we also see commonalities across reward functions, particularly when looking at initial states. If the agents are initialized in the top-left of the starting grid, is is much easier for them to reach the goal. On the other hand if they start in the bottom-right corner, they seem guaranteed to fail. Therefore when initialising a new episode, not all starts are equal in outcome. We hypothesize that a strictly optimal policy may not be able to do better, but further work is necessary to establish the exact failure conditions. We note that under this model, we can explain this particular result by looking at the modeled dynamics of real the environment: the economic output grows exponentially (as long as atmospheric carbon is low) and needs exponentially more energy to then sustain itself. The underlying equations dictate that if not enough renewable knowledge stock is available, then this energy will come from fossil fuels. This consequently increases the excess atmospheric carbon which accumulates very fast and causes the agent to cross the atmospheric carbon boundary. Such emergent behavior is also not immediately obvious when considering the system equations.

Conclusions

The question of whether algorithms could support policy making is of current interest, where more than ever we find governments having to make high-stake decisions with regards to fast changing complex, global and interconnected challenges, which are difficult to understand and tackle without relevant datasets, scientific evidence and scenario-analysis tools. Public policy-making is often a cyclical process, with stages such as identification of societal needs, formulation of agendas, scenarios and policy alternatives, adoption of policy decisions, implementation in the real world, and finally, evaluation of their effectiveness, with subsequent improvements. Our experiments aim to understand whether RL can be used to formulate policy alternatives and evaluate their effectiveness in a simulation environment. Specifically, we have shown that a variety of RL algorithms produce well-behaved policies in the AYS environment under different, more or less sparse, reward functions. The combination of different reward functions, RL algorithms, and the space of initial states produce diverse policies that can successfully explore and solve the underlying AYS model. This enables the study of the emergent properties of the system without needing to encode much knowledge about the model into the learning process. This is an interesting result, as it shows the potential of applying RL as a general debugging and analysis toolkit for IAM models.

We note that the solution to solving the AYS model relates to early action [MacCracken, 2008], as the agent solves the environment more consistently when implementing the more aggressive policy position early on in the episode. Early action dictates that fast and aggressive climate change mitigation has cumulative benefits. The concept of early action stems from the time value of carbon [Cornelis van Kooten et al., 2021]. Future work should look into exploring whether agents can be trained across multiple model-environments, to understand whether some kind of “common exploration strategy” emerges as a result, or whether agents could be trained to explore small, simplified models, and behave in a reasonable manner in computationally bigger, and thus more expensive IAMs. Appendix D presents further experiments, conclusions and future work.

References

  • Moore et al. [2022] Frances C. Moore, Katherine Lacasse, Katharine J. Mach, Yoon Ah Shin, Louis J. Gross, and Brian Beckage. Determinants of emissions pathways in the coupled climate–social system. Nature 2022 603:7899, 603(7899):103–111, 2 2022. ISSN 1476-4687. doi: 10.1038/s41586-022-04423-8. URL https://www.nature.com/articles/s41586-022-04423-8.
  • Parson and Fisher-Vanden [1997] Edward A Parson and Karen Fisher-Vanden. Integrated assessment models of global climate change. Annual Review of Energy and the Environment, 22(1):589–628, 1997.
  • Pörtner et al. [2022] Hans-O Pörtner, Debra C Roberts, Helen Adams, Carolina Adler, Paulina Aldunce, Elham Ali, Rawshan Ara Begum, Richard Betts, Rachel Bezner Kerr, Robbert Biesbroek, et al. Climate change 2022: Impacts, adaptation and vulnerability. IPCC Geneva, Switzerland:, 2022.
  • Asefi-Najafabady et al. [2021] Salvi Asefi-Najafabady, Laura Villegas-Ortiz, and Jamie Morgan. The failure of integrated assessment models as a response to ‘climate emergency’and ecological breakdown: the emperor has no clothes. Globalizations, 18(7):1178–1188, 2021.
  • Kittel et al. [2017] Tim Kittel, Finn Müller-Hansen, Rebekka Koch, Jobst Heitzig, Guillaume Deffuant, Jean Denis Mathias, and Jürgen Kurths. From lakes and glades to viability algorithms: Automatic classification of system states according to the Topology of Sustainable Management. European Physical Journal: Special Topics, 230(14-15):3133–3152, 6 2017. ISSN 19516401. doi: 10.48550/arxiv.1706.04542. URL https://arxiv.longhoe.net/abs/1706.04542v4.
  • Nitzbon et al. [2017] Jan Nitzbon, Jobst Heitzig, and Ulrich Parlitz. Sustainability, collapse and oscillations in a simple World-Earth model. Environmental Research Letters, 12(7):074020, 7 2017. ISSN 1748-9326. doi: 10.1088/1748-9326/AA7581. URL https://iopscience.iop.org/article/10.1088/1748-9326/aa7581https://iopscience.iop.org/article/10.1088/1748-9326/aa7581/meta.
  • Strnad et al. [2019] Felix M. Strnad, Wolfram Barfuss, Jonathan F. Donges, and Jobst Heitzig. Deep reinforcement learning in World-Earth system models to discover sustainable management strategies. Chaos: An Interdisciplinary Journal of Nonlinear Science, 29(12):123122, 12 2019. ISSN 1054-1500. doi: 10.1063/1.5124673. URL https://aip.scitation.org/doi/abs/10.1063/1.5124673.
  • Sutton and Barto [2020] Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. 2 edition, 2020. URL http://incompleteideas.net/book/the-book.html.
  • Rockström et al. [2009] Johan Rockström, Will Steffen, Kevin Noone, Asa Persson, F. Stuart Chapin, Eric F. Lambin, Timothy M. Lenton, Marten Scheffer, Carl Folke, Hans Joachim Schellnhuber, Björn Nykvist, Cynthia A. De Wit, Terry Hughes, Sander Van Der Leeuw, Henning Rodhe, Sverker Sörlin, Peter K. Snyder, Robert Costanza, Uno Svedin, Malin Falkenmark, Louise Karlberg, Robert W. Corell, Victoria J. Fabry, James Hansen, Brian Walker, Diana Liverman, Katherine Richardson, Paul Crutzen, and Jonathan A. Foley. A safe operating space for humanity. Nature 2009 461:7263, 461(7263):472–475, 9 2009. ISSN 1476-4687. doi: 10.1038/461472a. URL https://www.nature.com/articles/461472a.
  • Steffen et al. [2015] Will Steffen, Katherine Richardson, Johan Rockström, Sarah E. Cornell, Ingo Fetzer, Elena M. Bennett, Reinette Biggs, Stephen R. Carpenter, Wim De Vries, Cynthia A. De Wit, Carl Folke, Dieter Gerten, Jens Heinke, Georgina M. Mace, Linn M. Persson, Veerabhadran Ramanathan, Belinda Reyers, and Sverker Sörlin. Planetary boundaries: Guiding human development on a changing planet. Science, 347(6223), 2 2015. ISSN 10959203. doi: 10.1126/SCIENCE.1259855. URL http://dx.doi.
  • Keyßer and Lenzen [2021] Lorenz T Keyßer and Manfred Lenzen. 1.5 c degrowth scenarios suggest the need for new mitigation pathways. Nature communications, 12(1):2676, 2021.
  • Mnih et al. [2016] Volodymyr Mnih, Adria Puigdomenech Badia, Lehdi Mirza, Alex Graves, Tim Harley, Timothy P. Lillicrap, David Silver, and Koray Kavukcuoglu. Asynchronous Methods for Deep Reinforcement Learning. 33rd International Conference on Machine Learning, ICML 2016, 4:2850–2869, 2 2016. URL https://arxiv.longhoe.net/abs/1602.01783v2.
  • Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ArXiv, abs/1707.06347, 2017.
  • Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature 2015 518:7540, 518(7540):529–533, 2 2015. ISSN 1476-4687. doi: 10.1038/nature14236. URL https://www.nature.com/articles/nature14236.
  • Hessel et al. [2017] Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining Improvements in Deep Reinforcement Learning. 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, pages 3215–3222, 10 2017. ISSN 2159-5399. doi: 10.48550/arxiv.1710.02298. URL https://arxiv.longhoe.net/abs/1710.02298v1.
  • Ahmed et al. [2019] Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, and Dale Schuurmans. Understanding the impact of entropy on policy optimization. In International conference on machine learning, pages 151–160. PMLR, 2019.
  • Mei et al. [2020] **cheng Mei, Chenjun Xiao, Csaba Szepesvari, and Dale Schuurmans. On the global convergence rates of softmax policy gradient methods. In International Conference on Machine Learning, pages 6820–6829. PMLR, 2020.
  • Burda et al. [2018] Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network distillation. arXiv preprint arXiv:1810.12894, 2018.
  • MacCracken [2008] Michael C. MacCracken. Prospects for future climate change and the reasons for early action. Journal of the Air & Waste Management Association, 58(6):735–786, 2008. doi: 10.3155/1047-3289.58.6.735. URL https://doi.org/10.3155/1047-3289.58.6.735.
  • Cornelis van Kooten et al. [2021] G. Cornelis van Kooten, Patrick Withey, and Craig M.T. Johnston. Climate urgency and the timing of carbon fluxes. Biomass and Bioenergy, 151:106162, 2021. ISSN 0961-9534. doi: https://doi.org/10.1016/j.biombioe.2021.106162. URL https://www.sciencedirect.com/science/article/pii/S0961953421001987.
  • Bury et al. [2019] Thomas M. Bury, Chris T. Bauchid, and Madhur Anand. Charting pathways to climate change mitigation in a coupled socio-climate model. PLOS Computational Biology, 15(6):e1007000, 6 2019. ISSN 1553-7358. doi: 10.1371/JOURNAL.PCBI.1007000. URL https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007000.
  • Wang et al. [2015] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Frcitas. Dueling Network Architectures for Deep Reinforcement Learning. 33rd International Conference on Machine Learning, ICML 2016, 4:2939–2947, 11 2015. doi: 10.48550/arxiv.1511.06581. URL https://arxiv.longhoe.net/abs/1511.06581v3.
  • Lundberg and Lee [2017] Scott M Lundberg and Su-In Lee. A Unified Approach to Interpreting Model Predictions. In I Guyon, U Von Luxburg, S Bengio, H Wallach, R Fergus, S Vishwanathan, and R Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf.
  • Juozapaitis et al. [2019] Zoe Juozapaitis, Anurag Koul, Alan Fern, Martin Erwig, and Finale Doshi-Velez. Explainable Reinforcement Learning via Reward Decomposition. 2019. URL https://web.engr.oregonstate.edu/~erwig/papers/ExplainableRL_XAI19.pdf.
  • van der Waa et al. [2018] Jasper van der Waa, Jurriaan van Diggelen, Karel van den Bosch, and Mark Neerincx. Contrastive Explanations for Reinforcement Learning in terms of Expected Consequences. 7 2018. doi: 10.48550/arxiv.1807.08706. URL https://arxiv.longhoe.net/abs/1807.08706v1.

Appendix A A broader look at the AYS environment

Refer to caption
Figure 6: Schematic from Kittel et al. [2017] summarizing the interactions in the AYS model. The full lines are positive interactions and the dotted lines are negative interactions.

A.1 Observables

The model has three observed variables:

  • Excess atmospheric carbon, A𝐴Aitalic_A, in Gigaton of Carbon (GtC).

  • Economic output, Y𝑌Yitalic_Y, in US dollars per year ($/yr).

  • Renewable knowledge stock, S𝑆Sitalic_S, in GigaJoules (GJ).

This low dimensional environment enables us to test our framework’s limits as well as plot trajectories in phase space for tractable interpretability and analysis. Furthermore, the dynamics of model are also kept relatively simple compared to other more complex models Nitzbon et al. [2017], Bury et al. [2019], Moore et al. [2022]. We note that out of all three variables, the last one is less "tangible" than other ones, as it represents some knowledge metric humans have about renewable energy; to make it relevant to available quantitative data, this model chooses to represent it as energy.

A.2 Equations

The system is governed by three differential equations, one for each observed variable:

dAdt𝑑𝐴𝑑𝑡\displaystyle\frac{dA}{dt}divide start_ARG italic_d italic_A end_ARG start_ARG italic_d italic_t end_ARG =EA/τA,absent𝐸𝐴subscript𝜏𝐴\displaystyle=E-A/\tau_{A},= italic_E - italic_A / italic_τ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , (3)
dYdt𝑑𝑌𝑑𝑡\displaystyle\frac{dY}{dt}divide start_ARG italic_d italic_Y end_ARG start_ARG italic_d italic_t end_ARG =βYθAY,absent𝛽𝑌𝜃𝐴𝑌\displaystyle=\beta Y-\theta AY,= italic_β italic_Y - italic_θ italic_A italic_Y , (4)
dSdt𝑑𝑆𝑑𝑡\displaystyle\frac{dS}{dt}divide start_ARG italic_d italic_S end_ARG start_ARG italic_d italic_t end_ARG =RS/τS.absent𝑅𝑆subscript𝜏𝑆\displaystyle=R-S/\tau_{S}.= italic_R - italic_S / italic_τ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT . (5)

With R𝑅Ritalic_R and E𝐸Eitalic_E the energy extracted from renewables and fossil fuels respectively. These are defined with the energy demand U𝑈Uitalic_U in GJ/year, which is proportional to the economic output:

U=Yϵ,𝑈𝑌italic-ϵ\displaystyle U=\frac{Y}{\epsilon},italic_U = divide start_ARG italic_Y end_ARG start_ARG italic_ϵ end_ARG , (6)

where ϵitalic-ϵ\epsilonitalic_ϵ is the efficiency of energy. Energy is either produced from renewable sources or from fossil fuel sources:

R=(1Γ)U,𝑅1Γ𝑈\displaystyle R=(1-\Gamma)U,italic_R = ( 1 - roman_Γ ) italic_U , (7)
F=ΓU,𝐹Γ𝑈\displaystyle F=\Gamma U,italic_F = roman_Γ italic_U , (8)
E=F/ϕ.𝐸𝐹italic-ϕ\displaystyle E=F/\phi.italic_E = italic_F / italic_ϕ . (9)

Here, ϕitalic-ϕ\phiitalic_ϕ is the fossil fuel combustion efficiency in GJ/GtC. The share of fossil fuel energy ΓΓ\Gammaroman_Γ is calculated as an inverse response to the renewable knowledge:

Γ=11+(S/σ)ρ.Γ11superscript𝑆𝜎𝜌\displaystyle\Gamma=\frac{1}{1+(S/\sigma)^{\rho}}.roman_Γ = divide start_ARG 1 end_ARG start_ARG 1 + ( italic_S / italic_σ ) start_POSTSUPERSCRIPT italic_ρ end_POSTSUPERSCRIPT end_ARG . (10)

With σ𝜎\sigmaitalic_σ being the break-even knowledge, which corresponds to the state where renewable and fossil fuel costs become equal, and ρ𝜌\rhoitalic_ρ is the renewable knowledge learning rate. As knowledge on renewables (S𝑆Sitalic_S) increases, Γ0Γ0\Gamma\to 0roman_Γ → 0 and the total energy share produced by renewables increases. If S0𝑆0S\to 0italic_S → 0, then Γ1Γ1\Gamma\to 1roman_Γ → 1 and more energy is produced from fossil fuels. The interactions are summarized in Figure 6. The parameter values are summarized in Table 2.

Table 2: Table summarising the parameters of the AYS model from Kittel et al. [2017].
Parameter Value Description
τAsubscript𝜏𝐴\tau_{A}italic_τ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT 50 years Carbon decay out of the atmosphere.
τSsubscript𝜏𝑆\tau_{S}italic_τ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT 50 years Decay of renewable knowledge stock.
β𝛽\betaitalic_β 3 %/year Economic output growth.
σ𝜎\sigmaitalic_σ 4×10124superscript10124\times 10^{12}4 × 10 start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT GJ
Break-even knowledge: knowledge at which
fossil fuel and renewables have equal cost.
ϕitalic-ϕ\phiitalic_ϕ 4.7×10104.7superscript10104.7\times 10^{10}4.7 × 10 start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT GJ/GtC Fossil fuel combustion efficiency.
ϵitalic-ϵ\epsilonitalic_ϵ 147 $/GJ Energy efficiency parameter.
θ𝜃\thetaitalic_θ 8.57×1058.57superscript1058.57\times 10^{-5}8.57 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT Temperature sensitivity parameter.
ρ𝜌\rhoitalic_ρ 2 Learning rate of renewable knowledge.

A.3 States

The initial state is:

st=0=(240GtC7×1013$/yr5×1011GJ),subscript𝑠𝑡0matrix240𝐺𝑡𝐶7superscript1013currency-dollar𝑦𝑟5superscript1011𝐺𝐽\displaystyle s_{t=0}=\begin{pmatrix}240\,GtC\\ 7\times 10^{13}\,\$/yr\\ 5\times 10^{11}\,GJ\\ \end{pmatrix},italic_s start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT = ( start_ARG start_ROW start_CELL 240 italic_G italic_t italic_C end_CELL end_ROW start_ROW start_CELL 7 × 10 start_POSTSUPERSCRIPT 13 end_POSTSUPERSCRIPT $ / italic_y italic_r end_CELL end_ROW start_ROW start_CELL 5 × 10 start_POSTSUPERSCRIPT 11 end_POSTSUPERSCRIPT italic_G italic_J end_CELL end_ROW end_ARG ) , (14)

which aims to represent the current state of the Earth in this model. There are two attractors in this model,

sb=(β/θϕβϵθτA0)=(350GtC4.84×1013$/yr0GJ).subscript𝑠𝑏matrix𝛽𝜃italic-ϕ𝛽italic-ϵ𝜃subscript𝜏𝐴0matrix350𝐺𝑡𝐶4.84superscript1013currency-dollar𝑦𝑟0𝐺𝐽\displaystyle s_{b}=\begin{pmatrix}\beta/\theta\\ \frac{\phi\beta\epsilon}{\theta\tau_{A}}\\ 0\\ \end{pmatrix}=\begin{pmatrix}350\,GtC\\ 4.84\times 10^{13}\,\$/yr\\ 0\,GJ\end{pmatrix}.italic_s start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = ( start_ARG start_ROW start_CELL italic_β / italic_θ end_CELL end_ROW start_ROW start_CELL divide start_ARG italic_ϕ italic_β italic_ϵ end_ARG start_ARG italic_θ italic_τ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL 0 end_CELL end_ROW end_ARG ) = ( start_ARG start_ROW start_CELL 350 italic_G italic_t italic_C end_CELL end_ROW start_ROW start_CELL 4.84 × 10 start_POSTSUPERSCRIPT 13 end_POSTSUPERSCRIPT $ / italic_y italic_r end_CELL end_ROW start_ROW start_CELL 0 italic_G italic_J end_CELL end_ROW end_ARG ) . (21)

This is denoted as the black fixed point: roughly half of the current economic production, and this economic production is stagnant. Furthermore, in the black fixed point, there is an excess of 350 GtC in the atmosphere with no renewable energy production. The other point we are interested in is located at the boundaries of the state space,

sg=(0++),subscript𝑠𝑔matrix0\displaystyle s_{g}=\begin{pmatrix}0\\ +\infty\\ +\infty\end{pmatrix},italic_s start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = ( start_ARG start_ROW start_CELL 0 end_CELL end_ROW start_ROW start_CELL + ∞ end_CELL end_ROW start_ROW start_CELL + ∞ end_CELL end_ROW end_ARG ) , (25)

where economic growth and renewable energy knowledge grow forever. We label this the green fixed point, and we note that it corresponds to the ideal scenario.

The dynamics of this environment do not allow for more fixed points. Therefore, any point in the space will be naturally drawn to one of these fixed points.

Just as in Kittel et al. [2017], we normalize the state variables A𝐴Aitalic_A, Y𝑌Yitalic_Y and S𝑆Sitalic_S between 0 and 1. This prevents any numerical issues from arising in unexpected ways. The normalization scheme employed is the following:

st¯=stst+st=0.¯subscript𝑠𝑡subscript𝑠𝑡subscript𝑠𝑡subscript𝑠𝑡0\displaystyle\bar{s_{t}}=\frac{s_{t}}{s_{t}+s_{t=0}}.over¯ start_ARG italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG = divide start_ARG italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_s start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT end_ARG . (26)

This leads to the initial state being (0.5,0.5,0.5)Tsuperscript0.50.50.5𝑇(0.5,0.5,0.5)^{T}( 0.5 , 0.5 , 0.5 ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and the green fixed point to be sg=(0,1,1)Tsubscript𝑠𝑔superscript011𝑇s_{g}=(0,1,1)^{T}italic_s start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = ( 0 , 1 , 1 ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT.

A.4 Episode Description

The AYS model is a deterministic environment. The enable non-trivial learning, we initialize each new episode n a random state sampled under a fixed distribution:

st=0=(0.5+𝒰(0.05,0.05)0.5+𝒰(0.05,0.05)0.5),subscript𝑠𝑡0matrix0.5𝒰0.050.050.5𝒰0.050.050.5\displaystyle s_{t=0}=\begin{pmatrix}0.5+\mathcal{U}(-0.05,0.05)\\ 0.5+\mathcal{U}(-0.05,0.05)\\ 0.5\end{pmatrix},italic_s start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT = ( start_ARG start_ROW start_CELL 0.5 + caligraphic_U ( - 0.05 , 0.05 ) end_CELL end_ROW start_ROW start_CELL 0.5 + caligraphic_U ( - 0.05 , 0.05 ) end_CELL end_ROW start_ROW start_CELL 0.5 end_CELL end_ROW end_ARG ) , (30)

where 𝒰𝒰\mathcal{U}caligraphic_U is the uniform distribution. We do not add noise the third varialbe, as we notice that it dramatically reduces the ability for the agent to learn Strnad et al. [2019].

Each step corresponds to a difference of 1 year. The environment uses an Ordinary Differential Equation numerical solver to calculate the state for the next step given the action-specific parameters.

Appendix B Neural Network Architecture

We attempt to normalize the network architecture across all our experiments (and agents). The "torso" of the policy corresponds to three linear layers intercalated with ReLU activation functions. The input layer has three units, the hidden layer has 256 units and the output layer has four units. In the case of D3QN, we use a dueling network [Wang et al., 2015] such that there are two output layers connected to the hidden layer: one with four units and one with a single unit.

Appendix C Action distribution by reward function

We observe how different reward functions produce significantly different types of trajectories (Figure 5).

Refer to caption
Refer to caption
Refer to caption
Figure 7: Reward obtained and action taken per timestep of different sucessful DQN agents trained with different reward functions.

This shows that there are many qualitatively different pathways towards achieving the goal in the AYS environment, and that reward signals can easily be used as a way to embed structure into policy space. However, finding a strategy for precisely tuning the trajectories such that they may "evolve" in some specific manner is still an open problem, and we believe it to be robust. Nonetheless, the effect is noticeable across all our experiments.

Appendix D Further experiments and conclusions

In our experiments we observed significant issues with sensitivity to hyperparameters, which were very difficult to tune. The off-policy agents were much more flexible with learning the environment in different experiments, which is clear from their consistency across the experiments. The on-policy agents were lacking in exploration, which significantly hurt their performance when using the cost reward function.

Refer to caption
Figure 8: Moving average and standard deviation reward of the agents in the Noisy AYS environment. The noise standard deviation is set at 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT.

Comparing AYS to noisy AYS environment

We now test injecting Gaussian noise to the parameters of the environment at each new episode. Each new episode is then slightly different. This is significant, as in dynamical systems small changes early on can radically change the outcome, known in mathematics as chaos. This tests the agents’ robustness to different environment parameters to emulate the fact that such real-world parameters are never perfectly known. The PPO and A2C agents struggle significantly more in this environment but the DQN-based agents are relatively unaffected by the introduction of noise. This is promising as it shows that off-policy agents can learn in this noisy environment. This brings them one step closer to real-world application.

Comparing fully vs partially observable environments

We also test making the environment fully observable to the agent by giving the velocities of the variables as observable features. This enables us to show whether a partially observed environment is a significant hurdle to learning, as the real world is almost certainly only partially observed. We find that some of the agents can still learn equivalently well in a partially observed environment if fine-tuned to it (the agents’ hyperparameters were optimised for the partially observed environment). We can also contrast how the agent leverages the information from each observable in both environments by analysing the trained neural network parameters. We use SHAP values [Lundberg and Lee, 2017] as a proxy for feature importance, see Figure 9(b) below. We see that that the state velocity is used more by the agent to infer expected reward.

Refer to caption
Figure 9: Moving average and standard deviation reward of the four agents with a partially observed environment and with a fully observed one (labelled Markov), all trained over 500000 time steps. It is worth noting that the hyperparameters were not re-tuned for the fully observable environment such that we accurately compare identical agents in different environments.
Refer to caption
(a) D3QN
Refer to caption
(b) Markov D3QN
Figure 10: SHAP values of two D3QN agents, one trained in the partially observed environment and one trained in the fully observed environment. These plots use 500 states randomly sampled from the replay buffer after training.

Future extensions

There are many extensions that have not been explored in this work. Changes that were not looked at were changes in the number of actions that can be taken per year, we set this to one throughout this work, but there is no particular reason for this, apart from the easily interpretable idea of one policy per year. In this work, we focused more on the interpretable aspect and thus aimed to leave the fundamental dynamics of the model from Kittel et al. [2017] untouched. Additional actions or continuous actions are a clear avenue for probing the environment in different ways. There is also research in Explainable Artificial Intelligence (XAI) that could be integrated in this framework, specifically: explainable RL [Juozapaitis et al., 2019, van der Waa et al., 2018]. This would help with explainability of the agents and interpreting their decisions. Multi-agent RL may also be promising, simulating different drivers of different nations through differentiated reward functions.

13Vfd7vdu+FweG8YRkjXdWy329+dTgeSJD3ieZ7RNO0VAXAPwDEAO5VKndi2fWrb9jWl9Esul6PZbDY9Go1OZ7PZ9z/lyuD3OozU2wAAAABJRU5ErkJggg==" alt="[LOGO]">