License: arXiv.org perpetual non-exclusive license
arXiv:2211.03464v2 [quant-ph] 08 Mar 2024

A Survey on Quantum Reinforcement Learning

Nico Meyer Christian Ufrecht Maniraman Periyasamy Daniel D. Scherer Axel Plinge and Christopher Mutschler
Fraunhofer IIS
Fraunhofer Institute for Integrated Circuits IIS Nuremberg Germany
{firstname.lastname|daniel.scherer2}@iis.fraunhofer.de
(January 1, 2024)
Abstract

Quantum reinforcement learning is an emerging field at the intersection of quantum computing and machine learning. While we intend to provide a broad overview of the literature on quantum reinforcement learning – our interpretation of this term will be clarified below – we put particular emphasis on recent developments. With a focus on already available noisy intermediate-scale quantum devices, these include variational quantum circuits acting as function approximators in an otherwise classical reinforcement learning setting. In addition, we survey quantum reinforcement learning algorithms based on future fault-tolerant hardware, some of which come with a provable quantum advantage. We provide both a birds-eye-view of the field, as well as summaries and reviews for selected parts of the literature.

1 Introduction and Overview

With recent advances in the fabrication and control of hardware for quantum information processing, the possibilities of merging quantum computing (QC) with machine learning (ML) have received a huge amount of attention within the growing research community. Hereby, reinforcement learning (RL) is the third paradigm besides supervised and unsupervised learning. In this survey article, we provide an overview over so-called quantum reinforcement learning (QRL) algorithms. We understand these as quantum-assisted approaches, that solve a particular task (be they classical or quantum in nature) by employing quantum resources (either in simulation and/or in experiment).

In order to keep this contribution as self-contained as possible, we provide the necessary backgrounds before venturing into the QRL literature. We start out with a brief recap of the essentials of the RL paradigm in the fully classical setting in Sec. 2. Further, in Sec. 3 we provide a quick introduction to QC and variational quantum circuits (VQCs). Readers familiar with either of the topics may safely skip these sections.

Refer to caption
Figure 1: A possible classification matrix for QRL algorithms, where we took into account only those variants of QRL which we focus on in Sec. 4. The algorithm classes are ordered according to their degree of quantum-classical hybridization, ranging from purely classical to purely quantum. A more detailed review of the 22222222 selected works on QiRL-algorithms can be found in Sec. 4.1. VQC-based approaches are summarized in quite some detail in Sec. 4.2 – comprising of 68686868 papers. QRL-algorithms employing post-NISQ quantum algorithms as subroutines or even fully quantum approaches to QRL are described in Sec. 4.3, Sec. 4.4, Sec. 4.5 and Sec. 4.6, based on 30303030 selected manuscripts. The dashed vertical line between classical and NISQ compute resources indicates that presently it is unclear whether QRL with NISQ-compatible algorithms offers robust quantum advantage on a broad range of learning problems. The solid vertical line distinguishes post-NISQ algorithms from both classical and NISQ-compatible algorithms, as they typically come with guaranteed quantum advantage (at least relative to their classical counterparts).

In Sec. 4 we turn our attention to the emerging field of QRL, starting out with a quick overview of the literature. Then we delve into summaries of the most prominent contributions. This selection is necessarily subjective and reflects our own research interests – overall we identified 177177177177 relevant manuscripts, of which we reviewed 120120120120 explicitly. For a detailed overview on paper counts see LABEL:tab:number_publications. We organized our summaries into several blocks, that are ordered by what one could call an increasing degree of ‘quantiziation’. The first of these blocks in Sec. 4.1 covers what we refer to as ‘quantum-inspired’ RL algorithms. The second block in Sec. 4.2 takes a rather detailed look at QRL algorithms that employ so-called VQCs as function approximators. In many cases, the corresponding algorithms are obtained by simply replacing a standard neural network function approximator (or any other sort) by an appropriate VQC. We provide detailed summaries for most papers in this category, as variational quantum algorithms are believed to offer the potential to obtain quantum advantage despite the limitations of present day NISQ hardware. In Secs. 4.3 and 4.4, we take a look at realizations of QRL based on so-called projective simulation and the use of Boltzmann machines as function approximators, respectively. In Sec. 4.5 we move to a class of approaches that employ quantum algorithms as subroutines. The corresponding hardware requirements will likely be compatible only with universal, fault-tolerant and error-corrected quantum processing units (QPUs). Finally, Sec. 4.6 provides a summary for a formal approach to QRL, which treats all components of RL ‘quantumly’. From our point of view, the highest degree of quantization can thus be found in these approaches. Fig. 1 gives an overview of the QRL literature as understood in this survey.

2018absent2018\leq 2018≤ 2018 2019201920192019 2020202020202020 2021202120212021 2022202220222022 2023202320232023 ΣΣ\Sigmaroman_Σ
 Quantum-inspired QRL 12121212 1111 2222 5555 2222 00 22222222
 VQC-based QRL 00 2222 2222 9999 21212121 34343434 68686868
 QRL application𝐚𝐚{}^{\boldsymbol{\mathrm{a}}}start_FLOATSUPERSCRIPT bold_a end_FLOATSUPERSCRIPT 00 00 00 1111 7777 18181818 26262626
 Post-NISQ QRL 12121212 2222 2222 6666 4444 4444 30303030

Finally, in Sec. 5 we state our concluding thoughts on the current state-of-the-art of QRL. Before moving to more technical content, we would like to express our hope that this literature survey on QRL will be of use to colleagues and collaborators and the wider QC research community. It represents our effort to familiarize ourselves with QRL and its main research directions.

2 Classical Reinforcement Learning

Compared to the methods of supervised and unsupervised learning, which are typically implemented as passive learning, RL falls into the class of interaction-based learning [SB18]. On an abstract level, the learner interacts with its environment, the state of which it can either fully or only partially observe through a corresponding observation obtained after executing an action according to an underlying policy. In the RL paradigm, the learner is therefore appropriately referred to as an agent: it can - be it in simulation or in the real world - interact with its environment according to its abilities. The aim of RL is to learn a policy through the interaction of the agent with the environment, which is optimal with regard to a reward adapted to the problem. In other words, the agent should find an optimal policy during the learning process in the abstract space of all policies, which maximizes the expected cumulative reward. The theoretical basis for RL is formed by so-called Markov decision processes (MDPs) and the associated Bellman equation, which represents a consistency equation for the so-called value function. In turn, an optimal policy can be extracted from the optimal value function. Alternatively, the optimal policy can also be learned directly. Under certain conditions, the elements of RL can be mapped to their respective equivalents in control theory, where typically a dynamic optimization problem is solved by gradient-based methods with simulation of the corresponding model dynamics. On the RL side, there are both model-based and model-free approaches. The model-free approach in particular is one of the strengths of the RL method, since in many cases state and action spaces are too high-dimensional to design realistic dynamical models and simulate them to efficiently find optimal control strategies. The large dimensions of the spaces that occur in realistic problems make the use of approximation methods for the value function necessary. Driven by the breakthroughs in deep learning (DL), artificial neural networks (NNs) have established themselves as function approximators for both value function and policy (understood as a deterministic or probabilistic map** of states to actions), thus establishing the field of deep reinforcement learning (DRL).

In the following, we will introduce the various notions pertaining to RL in a more formal way and provide the background necessary to understand the basic RL terminology. In an RL scenario, the algorithm, also referred to as agent, generates its own data by interacting with an environment. This interaction happens over some discrete timesteps t𝑡titalic_t, which are accumulated to episodes with either finite or infinite horizon. In each timestep, the agent is able to make an observation st𝒮subscript𝑠𝑡𝒮s_{t}\in\mathcal{S}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_S of the environment. Based on this state information, an action at𝒜subscript𝑎𝑡𝒜a_{t}\in\mathcal{A}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A acting on the environment is selected according to a policy. Based on the (usually unknown) environment dynamics, the next state st+1𝒮subscript𝑠𝑡1𝒮s_{t+1}\in\mathcal{S}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∈ caligraphic_S is observed from the environment and the agent receives a reward rtsubscript𝑟𝑡r_{t}\in\mathcal{R}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_R for its choice. The agent should select the actions in such a way that some objective is optimized, usually related to the long term reward. A sketch of this pipeline can be found in Fig. 2. In this survey article, we follow the formalism and notation of Sutton et al. [SB18], with small adaptions wherever we feel that it eases comprehension.

Refer to caption
Figure 2: Interaction between agent and environment for one timestep of a RL task.
Reinforcement Learning as a Markov Decision Process

More formally, this setup is usually described as an MDP. A finite MDP is a 5-tuple (𝒮,𝒜,,p,γ)𝒮𝒜𝑝𝛾(\mathcal{S},\mathcal{A},\mathcal{R},p,\gamma)( caligraphic_S , caligraphic_A , caligraphic_R , italic_p , italic_γ ), where the sets 𝒮𝒮\mathcal{S}caligraphic_S, 𝒜𝒜\mathcal{A}caligraphic_A and \mathcal{R}caligraphic_R are finite. It is defined by the following components:

  • A set of states 𝒮𝒮\mathcal{S}caligraphic_S the agent can observe from the environment

  • A set of actions 𝒜𝒜\mathcal{A}caligraphic_A the agent can execute in the environment

  • A set of rewards \mathcal{R}\subset\mathbb{R}caligraphic_R ⊂ blackboard_R the agent can receive from the environment

  • The environment dynamics p:𝒮××𝒜×𝒮[0,1]:𝑝𝒮𝒜𝒮01p:\mathcal{S}\times\mathcal{R}\times\mathcal{A}\times\mathcal{S}\to[0,1]italic_p : caligraphic_S × caligraphic_R × caligraphic_A × caligraphic_S → [ 0 , 1 ]; The value p(s,r|s,a):=Pr{st+1=s,rt=r|st=s,at=a}assign𝑝superscript𝑠conditional𝑟𝑠𝑎Prconditional-setformulae-sequencesubscript𝑠𝑡1superscript𝑠subscript𝑟𝑡𝑟formulae-sequencesubscript𝑠𝑡𝑠subscript𝑎𝑡𝑎p(s^{\prime},r|s,a):=\text{Pr}\{s_{t+1}=s^{\prime},r_{t}=r|s_{t}=s,a_{t}=a\}italic_p ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r | italic_s , italic_a ) := Pr { italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_r | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a } gives the probability that the environment transitions to state st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT and the agent receives reward rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, if the agents executes action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at time t𝑡titalic_t.

  • The discount factor 0γ10𝛾10\leq\gamma\leq 10 ≤ italic_γ ≤ 1, more on this below;

The dynamics of the environment are often not accessible to the agent, otherwise the task collapses to (not necessarily trivial) dynamic programming. The function p𝑝pitalic_p satisfies the properties of a probability density function (PDF), i.e., it holds s𝒮,rp(s,r|s,a)=1subscriptformulae-sequencesuperscript𝑠𝒮𝑟𝑝superscript𝑠conditional𝑟𝑠𝑎1\sum_{s^{\prime}\in\mathcal{S},r\in\mathcal{R}}p(s^{\prime},r|s,a)=1∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S , italic_r ∈ caligraphic_R end_POSTSUBSCRIPT italic_p ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r | italic_s , italic_a ) = 1, for all choices of s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S and a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A. According to the Markov property, the dynamics are completely described by p𝑝pitalic_p, i.e., the consecutive state st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT and reward rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT depend solely on the directly preceding state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

With this framework in mind, the interaction between agent and environment can be described as a trajectory τ𝜏\tauitalic_τ. For a finite or infinite horizon H𝐻Hitalic_H, one episode is therefore given by the sequence

τ=[s0,a0,r0,s1,a1,r1,s2,,sH1,aH1,rH1],𝜏subscript𝑠0subscript𝑎0subscript𝑟0subscript𝑠1subscript𝑎1subscript𝑟1subscript𝑠2subscript𝑠𝐻1subscript𝑎𝐻1subscript𝑟𝐻1\tau=\left[s_{0},a_{0},r_{0},s_{1},a_{1},r_{1},s_{2},\cdots,s_{H-1},a_{H-1},r_% {H-1}\right],italic_τ = [ italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_s start_POSTSUBSCRIPT italic_H - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_H - 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_H - 1 end_POSTSUBSCRIPT ] , (1)

with st𝒮subscript𝑠𝑡𝒮s_{t}\in\mathcal{S}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_S, at𝒜subscript𝑎𝑡𝒜a_{t}\in\mathcal{A}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A, and rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT sampled following the environment dynamics for each timestep t𝑡titalic_t.

Long Term Reward as Objective

The agent gets feedback from the environment through the immediate rewards rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. However, instead of maximizing these short-term rewards, it is much more appropriate to use some long term measure as objective. A natural choice is to go for the cumulative reward, also referred to as the expected return

Gt:=rt+rt+1+rt+2++rH1.assignsubscript𝐺𝑡subscript𝑟𝑡subscript𝑟𝑡1subscript𝑟𝑡2subscript𝑟𝐻1G_{t}:=r_{t}+r_{t+1}+r_{t+2}+\cdots+r_{H-1}.italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT italic_t + 2 end_POSTSUBSCRIPT + ⋯ + italic_r start_POSTSUBSCRIPT italic_H - 1 end_POSTSUBSCRIPT . (2)

For episodic tasks (H<𝐻H<\inftyitalic_H < ∞) it is often desirable and for continuous tasks (H=𝐻H=\inftyitalic_H = ∞) it is necessary to use a discount factor γ𝛾\gammaitalic_γ. This leads to the discounted (expected) return

Gt:=t=tH1γttrt,assignsubscript𝐺𝑡superscriptsubscriptsuperscript𝑡𝑡𝐻1superscript𝛾superscript𝑡𝑡subscript𝑟superscript𝑡G_{t}:=\sum\limits_{t^{\prime}=t}^{H-1}\gamma^{t^{\prime}-t}\cdot r_{t^{\prime% }},italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := ∑ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_t end_POSTSUPERSCRIPT ⋅ italic_r start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , (3)

where each choice of γ𝛾\gammaitalic_γ defines a different MDP. For γ<1𝛾1\gamma<1italic_γ < 1 the value of Gtsubscript𝐺𝑡G_{t}italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is guaranteed to be finite and emphasis on individual rewards decreases with distance from the current time-step. For γ=0𝛾0\gamma=0italic_γ = 0 the sum reduces to just the immediate reward, so an appropriate choice of this hyperparameter is crucial for the potential success of the RL agent.

Policy, Value Functions and Optimality

In order to describe a meaningful RL setup, there are still some concepts missing. As described above, the agent needs to decide for an action in every timestep, depending on the state information that is observed. This decision making process can be understood as a (stochastic) policy

π(a|s):=Pr{at=a|st=s},assign𝜋conditional𝑎𝑠Prconditional-setsubscript𝑎𝑡𝑎subscript𝑠𝑡𝑠\pi\left(a|s\right):=\text{Pr}\{a_{t}=a|s_{t}=s\},italic_π ( italic_a | italic_s ) := Pr { italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s } , (4)

where a𝒜π(a|s)=1subscript𝑎𝒜𝜋conditional𝑎𝑠1\sum_{a\in\mathcal{A}}\pi\left(a|s\right)=1∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_π ( italic_a | italic_s ) = 1 holds for all s𝒜𝑠𝒜s\in\mathcal{A}italic_s ∈ caligraphic_A. The overall task of RL is to derive an optimal policy π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT w.r.t. some metric.

A suitable tool to define optimality and also to simplify updates is the notion of value functions. The state value function of state s𝑠sitalic_s under the current policy π𝜋\piitalic_π is defined as

Vπ(s):=𝔼π[Gt|st=s].assignsubscript𝑉𝜋𝑠subscript𝔼𝜋delimited-[]conditionalsubscript𝐺𝑡subscript𝑠𝑡𝑠V_{\pi}(s):=\mathbb{E}_{\pi}\left[G_{t}|s_{t}=s\right].italic_V start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s ) := blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s ] . (5)

It describes the expected returns when starting in state s𝑠sitalic_s and following policy π𝜋\piitalic_π from there on, with the value for a terminal state always zero by definition. It can be interpreted as a measure of how good it is to be in a certain state, where quality is measured w.r.t. expected return. Explicitly separating the first step in the definition above gives rise to the Bellman (expectation) equation

Vπ(s)=a𝒜π(a|s)s𝒮,rp(s,r|s,a)[r+γVπ(s)],subscript𝑉𝜋𝑠subscript𝑎𝒜𝜋conditional𝑎𝑠subscriptformulae-sequencesuperscript𝑠𝒮𝑟𝑝superscript𝑠conditional𝑟𝑠𝑎delimited-[]𝑟𝛾subscript𝑉𝜋superscript𝑠V_{\pi}(s)=\sum\limits_{a\in\mathcal{A}}\pi\left(a|s\right)\sum\limits_{s^{% \prime}\in\mathcal{S},r\in\mathcal{R}}p\left(s^{\prime},r|s,a\right)\left[r+% \gamma\cdot V_{\pi}(s^{\prime})\right],italic_V start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s ) = ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_π ( italic_a | italic_s ) ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S , italic_r ∈ caligraphic_R end_POSTSUBSCRIPT italic_p ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r | italic_s , italic_a ) [ italic_r + italic_γ ⋅ italic_V start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] , (6)

for all s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S. Consequently, the value function Vπsubscript𝑉𝜋V_{\pi}italic_V start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT can be viewed as the unique solution to this Bellman equation. Alternatively, one can define the state-action value function as the expected return when starting in state s𝑠sitalic_s, executing action a𝑎aitalic_a, and following policy π𝜋\piitalic_π from there on. It is defined as

Qπ(s,a):=𝔼π[Gt|st=s,at=a],assignsubscript𝑄𝜋𝑠𝑎subscript𝔼𝜋delimited-[]formulae-sequenceconditionalsubscript𝐺𝑡subscript𝑠𝑡𝑠subscript𝑎𝑡𝑎Q_{\pi}(s,a):=\mathbb{E}_{\pi}\left[G_{t}|s_{t}=s,a_{t}=a\right],italic_Q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s , italic_a ) := blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a ] , (7)

for all s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S and a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A. It is straightforward to see that it holds Vπ(s)=a𝒜π(a|s)Qπ(s,a)subscript𝑉𝜋𝑠subscript𝑎𝒜𝜋conditional𝑎𝑠subscript𝑄𝜋𝑠𝑎V_{\pi}(s)=\sum_{a\in\mathcal{A}}\pi\left(a|s\right)Q_{\pi}(s,a)italic_V start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s ) = ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_π ( italic_a | italic_s ) italic_Q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s , italic_a ) for all s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S. This identity can be used to give the Bellman equation for the state-action value function as Qπ(s,a)=s𝒮,rp(s,r|s,a)[r+γa𝒜π(a|s)Qπ(s,a)]subscript𝑄𝜋𝑠𝑎subscriptformulae-sequencesuperscript𝑠𝒮𝑟𝑝superscript𝑠conditional𝑟𝑠𝑎delimited-[]𝑟𝛾subscriptsuperscript𝑎𝒜𝜋conditionalsuperscript𝑎superscript𝑠subscript𝑄𝜋superscript𝑠superscript𝑎Q_{\pi}(s,a)=\sum_{s^{\prime}\in\mathcal{S},r\in\mathcal{R}}p(s^{\prime},r|s,a% )\left[r+\gamma\cdot\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}|s^{\prime})Q% _{\pi}(s^{\prime},a^{\prime})\right]italic_Q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s , italic_a ) = ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S , italic_r ∈ caligraphic_R end_POSTSUBSCRIPT italic_p ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r | italic_s , italic_a ) [ italic_r + italic_γ ⋅ ∑ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_A end_POSTSUBSCRIPT italic_π ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_Q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ].

The value function allows to explicitly define and evaluate the quality of policies, i.e., the policy π𝜋\piitalic_π is better or equal to another policy πsuperscript𝜋\pi^{\prime}italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, iff Vπ(s)Vπ(s)subscript𝑉𝜋𝑠subscript𝑉superscript𝜋𝑠V_{\pi}(s)\geq V_{\pi^{\prime}}(s)italic_V start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s ) ≥ italic_V start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) for all s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S. If a policy is better or equal to all others, it is considered an optimal policy π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT. All optimal policies share the same optimal state-value function

Vπ*(s):=V*(s):=max𝜋Vπ(s),assignsubscript𝑉superscript𝜋𝑠superscript𝑉𝑠assign𝜋subscript𝑉𝜋𝑠V_{\pi^{*}}(s):=V^{*}(s):=\underset{\pi}{\max}\leavevmode\nobreak\ V_{\pi}(s),italic_V start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) := italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s ) := underitalic_π start_ARG roman_max end_ARG italic_V start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s ) , (8)

for all s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S. A similar notion of optimality for the action-value function is given by

Q*(s,a):=max𝜋Qπ(s,a),assignsuperscript𝑄𝑠𝑎𝜋subscript𝑄𝜋𝑠𝑎Q^{*}(s,a):=\underset{\pi}{\max}\leavevmode\nobreak\ Q_{\pi}(s,a),italic_Q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s , italic_a ) := underitalic_π start_ARG roman_max end_ARG italic_Q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s , italic_a ) , (9)

for all s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S and a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A. It is straightforward to formulate the connection of both quantities as V*(s)=max𝜋(a𝒜π(a|s)Qπ(s,a))=maxa𝒜Q*(s,a)superscript𝑉𝑠𝜋subscript𝑎𝒜𝜋conditional𝑎𝑠subscript𝑄𝜋𝑠𝑎𝑎𝒜superscript𝑄𝑠𝑎V^{*}(s)=\underset{\pi}{\max}\leavevmode\nobreak\ \left(\sum_{a\in\mathcal{A}}% \pi\left(a|s\right)Q_{\pi}(s,a)\right)=\underset{a\in\mathcal{A}}{\max}% \leavevmode\nobreak\ Q^{*}(s,a)italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s ) = underitalic_π start_ARG roman_max end_ARG ( ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_π ( italic_a | italic_s ) italic_Q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s , italic_a ) ) = start_UNDERACCENT italic_a ∈ caligraphic_A end_UNDERACCENT start_ARG roman_max end_ARG italic_Q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s , italic_a ). With this it is possible to derive the Bellman optimality equation for the value function as

V*(s)=maxa𝒜s𝒮,rp(s,r|s,a)[r+γV*(s)],superscript𝑉𝑠𝑎𝒜subscriptformulae-sequencesuperscript𝑠𝒮𝑟𝑝superscript𝑠conditional𝑟𝑠𝑎delimited-[]𝑟𝛾superscript𝑉superscript𝑠V^{*}(s)=\underset{a\in\mathcal{A}}{\max}\leavevmode\nobreak\ \sum\limits_{s^{% \prime}\in\mathcal{S},r\in\mathcal{R}}p\left(s^{\prime},r|s,a\right)\left[r+% \gamma\cdot V^{*}(s^{\prime})\right],italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s ) = start_UNDERACCENT italic_a ∈ caligraphic_A end_UNDERACCENT start_ARG roman_max end_ARG ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S , italic_r ∈ caligraphic_R end_POSTSUBSCRIPT italic_p ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r | italic_s , italic_a ) [ italic_r + italic_γ ⋅ italic_V start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] , (10)

for all s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S. Using the stated connection this can be reformulated to extend to the state-action value function as Q*(s,a)=s𝒮,rp(s,r|s,a)[r+γmaxa𝒜Q*(s,a)]superscript𝑄𝑠𝑎subscriptformulae-sequencesuperscript𝑠𝒮𝑟𝑝superscript𝑠conditional𝑟𝑠𝑎delimited-[]𝑟𝛾superscript𝑎𝒜superscript𝑄superscript𝑠superscript𝑎Q^{*}(s,a)=\sum_{s^{\prime}\in\mathcal{S},r\in\mathcal{R}}p(s^{\prime},r|s,a)% \left[r+\gamma\cdot\underset{a^{\prime}\in\mathcal{A}}{\max}\leavevmode% \nobreak\ Q^{*}(s^{\prime},a^{\prime})\right]italic_Q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s , italic_a ) = ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S , italic_r ∈ caligraphic_R end_POSTSUBSCRIPT italic_p ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r | italic_s , italic_a ) [ italic_r + italic_γ ⋅ start_UNDERACCENT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_A end_UNDERACCENT start_ARG roman_max end_ARG italic_Q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] for all s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S and a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A.

Solving and Approximating the Bellman Equation

One topic that has to be addressed is the actual representation of the policy and value functions. The most intuitive approach is to just store the values for all state-action pairs in a table, also referred to as the tabular approach. While this formulation offers nice convergence and optimality guarantees for several scenarios, it has some serious drawbacks. Most prominently, it is intractable once the state-action space gets to large, which is the case for most real-world problems. A workaround is to use parametric function approximators, which results in the parameterized functions πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, Vθsubscript𝑉𝜃V_{\theta}italic_V start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, or Qθsubscript𝑄𝜃Q_{\theta}italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, respectively. The typical choice is a NN [HSW89], in Sec. 4.2 the usage of VQCs for this task is considered from several angles. As there now is an approximation in the defining quantities, also convergence guarantees are much less straightforward than for the tabular case. The remaining parts of this section can be understood both for the tabular and parameterized case, although details might vary a bit.

The Bellman optimality equation offers a tool to derive an optimal policy. It has to be noted that the given formulation makes use of the environment dynamics p𝑝pitalic_p. Therefore, solution methods solving the equation with dynamic programming are referred to as model-based. The two most prominent examples include value iteration [Bel57] and policy iteration [RN94, PRD96].

There is also a whole range of model-free approaches, where the agent does not make use of any model that represents the environment dynamics. Instead, all information is directly acquired by interaction with the environment. One prominent representative is the Q𝑄Qitalic_Q-learning approach [WD92], which basically is an approximation of Q𝑄Qitalic_Q-value iteration using samples. Starting with a random initialization, the update rule

Q(s,a)Q(s,a)+α(rt+γmaxa𝒜Q(s,a)Q(s,a))𝑄𝑠𝑎𝑄𝑠𝑎𝛼subscript𝑟𝑡𝛾superscript𝑎𝒜𝑄superscript𝑠superscript𝑎𝑄𝑠𝑎Q(s,a)\leftarrow Q(s,a)+\alpha\left(r_{t}+\gamma\cdot\underset{a^{\prime}\in% \mathcal{A}}{\max}\leavevmode\nobreak\ Q(s^{\prime},a^{\prime})-Q(s,a)\right)italic_Q ( italic_s , italic_a ) ← italic_Q ( italic_s , italic_a ) + italic_α ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_γ ⋅ start_UNDERACCENT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_A end_UNDERACCENT start_ARG roman_max end_ARG italic_Q ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_Q ( italic_s , italic_a ) ) (11)

directly derives from the Bellman equation, where α𝛼\alphaitalic_α is a learning rate hyperparameter. The policy is usually defined to act epsilon-greedily w.r.t. the current action-value function, i.e.

π(s):={argmaxa𝒜Q(s,a)with probability 1ε,uniformly at random from 𝒜with probability ε.assign𝜋𝑠cases𝑎𝒜𝑄𝑠𝑎with probability 1𝜀uniformly at random from 𝒜with probability 𝜀\pi(s):=\begin{cases}\underset{a\in\mathcal{A}}{\arg\max}\leavevmode\nobreak\ % Q(s,a)&\text{with probability }1-\varepsilon,\\ \text{uniformly at random from }\mathcal{A}&\text{with probability }% \varepsilon.\end{cases}italic_π ( italic_s ) := { start_ROW start_CELL start_UNDERACCENT italic_a ∈ caligraphic_A end_UNDERACCENT start_ARG roman_arg roman_max end_ARG italic_Q ( italic_s , italic_a ) end_CELL start_CELL with probability 1 - italic_ε , end_CELL end_ROW start_ROW start_CELL uniformly at random from caligraphic_A end_CELL start_CELL with probability italic_ε . end_CELL end_ROW (12)

An alternative approach is the policy gradient idea [Sut+99], which directly aims to learn the policy. Based on an parameterized policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, it performs updates

θθ+αθJ(θ)𝜃𝜃𝛼subscript𝜃𝐽𝜃\theta\leftarrow\theta+\alpha\nabla_{\theta}J(\theta)italic_θ ← italic_θ + italic_α ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_J ( italic_θ ) (13)

via gradient ascent, where J(θ)𝐽𝜃J(\theta)italic_J ( italic_θ ) is a performance measure, usually J(θ)=Vπθ(s0)𝐽𝜃subscript𝑉subscript𝜋𝜃subscript𝑠0J(\theta)=V_{\pi_{\theta}}(s_{0})italic_J ( italic_θ ) = italic_V start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). Unfortunately the desired gradient likely depends on some environment dynamics, which are not known. The policy gradient theorem [Sut+99] describes a quantity proportional to θVπθsubscript𝜃subscript𝑉subscript𝜋𝜃\nabla_{\theta}V_{\pi_{\theta}}∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT, which is easier to obtain. It is given by

θVπθ(s0)s𝒮μ(s)a𝒜Qπθ(s,a)θπθ(a|s),proportional-tosubscript𝜃subscript𝑉subscript𝜋𝜃subscript𝑠0subscript𝑠𝒮𝜇𝑠subscript𝑎𝒜subscript𝑄subscript𝜋𝜃𝑠𝑎subscript𝜃subscript𝜋𝜃conditional𝑎𝑠\nabla_{\theta}V_{\pi_{\theta}}(s_{0})\propto\sum\limits_{s\in\mathcal{S}}\mu(% s)\sum\limits_{a\in\mathcal{A}}Q_{\pi_{\theta}}(s,a)\nabla_{\theta}\pi_{\theta% }\left(a|s\right),∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∝ ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_μ ( italic_s ) ∑ start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a | italic_s ) , (14)

where μ(s)𝜇𝑠\mu(s)italic_μ ( italic_s ) is a function that expresses the fraction of time that is spend in state s𝑠sitalic_s. An concrete instance of this idea is the REINFORCE algorithm [Wil92], where a Monte Carlo method is used to estimate the quantity described in the equation above. Furthermore, the training procedure can be stabilized by introducing a suitable baseline function that reduces the variance of the expected return [Zha+11].

Overall, there are several extensions and modifications of the described concepts. One method worth mentioning is the actor-critic approach [KT03], which combines ideas form policy gradient and value functions. As for smaller modifications, there is double Q𝑄Qitalic_Q-learning, which introduces an additional target action-value function to reduce some bias caused by the standard Q𝑄Qitalic_Q-learning procedure [Has10]. Similarly, the introduction of an experience replay buffer [Lin92] should improve stability and sample efficiency. This finally leads to offline or batch RL [EGW05], where the agent is not allowed to directly interact with the environment. Instead, it only has access to a set of previously collected experiences. This formulation is especially relevant in practice, as generating data is sometimes quite expensive. There is still a wide range of topics this summary did not touch. Where necessary, additional details will also be introduced in the upcoming chapters. For a more broad introduction to the topic one can refer to Ref. [SB18], more recent developments are e.g. reviewed in Refs. [Aru+17, NLH20].

3 The Quantum Computing Paradigm

The foundations of QC were established at the beginning 20th century when the modern theory of quantum physics was developed. Benioff and Feynman proposed the idea of taking advantage of quantum mechanical systems for computing in the early 1980s [Ben80, Fey82]. QC challenges the strong Church-Turing hypothesis, as it potentially provides efficient solutions to classically intractable problems [NL16]. This section gives a pragmatic introduction to the basics of QC, and also provides an extension to quantum machine learning (QML) (here understood as ML with VQCs as a new class of models) with a focus on QRL.

Single and Multi-Qubit Systems

Similar to RL, notation and conventions regarding quantum computing vary quite a bit throughout the literature. Regarding notation, we closely follow the textbook by Nielsen and Chuang [NL16].

For the moment, let us consider the basic unit of information for classical information processing. A single bit is either in state 00 or state 1111, consequently, a sequence of n𝑛nitalic_n bits can represent 2nsuperscript2𝑛2^{n}2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT unique values. Obviously, the bit register can only be in one of these 2nsuperscript2𝑛2^{n}2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT states at any point in time.

A qubit is the quantum version of a bit. We use the Dirac notation [NL16] to define |0ket0\ket{0}| start_ARG 0 end_ARG ⟩ and |1ket1\ket{1}| start_ARG 1 end_ARG ⟩ as two distinct, orthogonal states of the qubit system. These basis states span a 2222-dimensional Hilbert space 2superscript2\mathcal{H}\cong\mathbb{C}^{2}caligraphic_H ≅ blackboard_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, which contains all 1111-qubit (pure) quantum states. The qubits are subject to the laws of quantum mechanics and can be realized with, e.g., spin systems of subatomic particles [PMV02], ion traps [BBA14], neutral atoms [SWM10], or superconducting circuits [YN06]. This gives rise to some interesting properties. In fact, a qubit can not only be in either state |0ket0\ket{0}| start_ARG 0 end_ARG ⟩ or |1ket1\ket{1}| start_ARG 1 end_ARG ⟩, but in a superposition of both. An arbitrary 1111-qubit state is given as

|ψ=α|0+β|1.ket𝜓𝛼ket0𝛽ket1\displaystyle\ket{\psi}=\alpha\ket{0}+\beta\ket{1}.| start_ARG italic_ψ end_ARG ⟩ = italic_α | start_ARG 0 end_ARG ⟩ + italic_β | start_ARG 1 end_ARG ⟩ . (15)
Refer to caption
Figure 3: Bloch sphere representation of a 1111-qubit state.

The amplitudes α𝛼\alphaitalic_α and β𝛽\betaitalic_β are complex numbers, which must satisfy |α|2+|β|2=1superscript𝛼2superscript𝛽21\absolutevalue{\alpha}^{2}+\absolutevalue{\beta}^{2}=1| start_ARG italic_α end_ARG | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + | start_ARG italic_β end_ARG | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1. To get a nice visual representation, Eq. 15 can be reformulated as

|ψ=eiγ(cosθ2|0+eiϕsinθ2|1),ket𝜓superscript𝑒𝑖𝛾𝜃2ket0superscript𝑒𝑖italic-ϕ𝜃2ket1\displaystyle\ket{\psi}=e^{i\gamma}\left(\cos\frac{\theta}{2}\ket{0}+e^{i\phi}% \sin\frac{\theta}{2}\ket{1}\right),| start_ARG italic_ψ end_ARG ⟩ = italic_e start_POSTSUPERSCRIPT italic_i italic_γ end_POSTSUPERSCRIPT ( roman_cos divide start_ARG italic_θ end_ARG start_ARG 2 end_ARG | start_ARG 0 end_ARG ⟩ + italic_e start_POSTSUPERSCRIPT italic_i italic_ϕ end_POSTSUPERSCRIPT roman_sin divide start_ARG italic_θ end_ARG start_ARG 2 end_ARG | start_ARG 1 end_ARG ⟩ ) , (16)

with γ,θ,ϕ𝛾𝜃italic-ϕ\gamma,\leavevmode\nobreak\ \theta,\leavevmode\nobreak\ \phi\in\mathbb{R}italic_γ , italic_θ , italic_ϕ ∈ blackboard_R. As any global phase has no observable effect [NL16], the prefactor eiγsuperscript𝑒𝑖𝛾e^{i\gamma}italic_e start_POSTSUPERSCRIPT italic_i italic_γ end_POSTSUPERSCRIPT in Eq. 16 can be omitted. This representation makes it possible to visualize the state of a 1111-qubit system on the surface of the Bloch sphere, see Fig. 3. The north and south poles w.r.t. the z𝑧zitalic_z-axis correspond to the basis states |0ket0\ket{0}| start_ARG 0 end_ARG ⟩ and |1ket1\ket{1}| start_ARG 1 end_ARG ⟩, which are also referred to as computational basis states of a single qubit. Another, less commonly used basis is given by the poles w.r.t. the x𝑥xitalic_x-axis, the elements are related by |+=|0+|12ketket0ket12\ket{+}=\frac{\ket{0}+\ket{1}}{\sqrt{2}}| start_ARG + end_ARG ⟩ = divide start_ARG | start_ARG 0 end_ARG ⟩ + | start_ARG 1 end_ARG ⟩ end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG and |=|0|12ketket0ket12\ket{-}=\frac{\ket{0}-\ket{1}}{\sqrt{2}}| start_ARG - end_ARG ⟩ = divide start_ARG | start_ARG 0 end_ARG ⟩ - | start_ARG 1 end_ARG ⟩ end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG. Similarly, one could also use |R=|0+i|12ket𝑅ket0𝑖ket12\ket{R}=\frac{\ket{0}+i\ket{1}}{\sqrt{2}}| start_ARG italic_R end_ARG ⟩ = divide start_ARG | start_ARG 0 end_ARG ⟩ + italic_i | start_ARG 1 end_ARG ⟩ end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG and |L=|0i|12ket𝐿ket0𝑖ket12\ket{L}=\frac{\ket{0}-i\ket{1}}{\sqrt{2}}| start_ARG italic_L end_ARG ⟩ = divide start_ARG | start_ARG 0 end_ARG ⟩ - italic_i | start_ARG 1 end_ARG ⟩ end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG. An alternative representation associates quantum states with amplitude vectors:

|0[10] and |1[01]ket0matrix10 and ket1matrix01\displaystyle\ket{0}\to\begin{bmatrix}1\\ 0\end{bmatrix}\text{ and }\ket{1}\to\begin{bmatrix}0\\ 1\end{bmatrix}| start_ARG 0 end_ARG ⟩ → [ start_ARG start_ROW start_CELL 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL end_ROW end_ARG ] and | start_ARG 1 end_ARG ⟩ → [ start_ARG start_ROW start_CELL 0 end_CELL end_ROW start_ROW start_CELL 1 end_CELL end_ROW end_ARG ] (21)

Multiple-qubit systems are the point where things get interesting. An n𝑛nitalic_n-qubit system gives access to the 2nsuperscript2𝑛2^{n}2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT-dimensional Hilbert space, in which an arbitrary pure quantum state is defined as

|ψ=c0|0000+c1|0001++c2n1|1111,ket𝜓subscript𝑐0ket0000subscript𝑐1ket0001subscript𝑐superscript2𝑛1ket1111\ket{\psi}=c_{0}\ket{00\cdots 00}+c_{1}\ket{00\cdots 01}+\cdots+c_{2^{n}-1}% \ket{11\cdots 11},| start_ARG italic_ψ end_ARG ⟩ = italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | start_ARG 00 ⋯ 00 end_ARG ⟩ + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | start_ARG 00 ⋯ 01 end_ARG ⟩ + ⋯ + italic_c start_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - 1 end_POSTSUBSCRIPT | start_ARG 11 ⋯ 11 end_ARG ⟩ , (22)

with cisubscript𝑐𝑖c_{i}\in\mathbb{C}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_C and i=02n1|ci|2=1superscriptsubscript𝑖0superscript2𝑛1superscriptsubscript𝑐𝑖21\sum_{i=0}^{2^{n}-1}\absolutevalue{c_{i}}^{2}=1∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT | start_ARG italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1. The basis states, e.g. |0001=|0|0|0|1ket0001tensor-productket0ket0ket0ket1\ket{00\cdots 01}=\ket{0}\otimes\ket{0}\otimes\cdots\otimes\ket{0}\otimes\ket{1}| start_ARG 00 ⋯ 01 end_ARG ⟩ = | start_ARG 0 end_ARG ⟩ ⊗ | start_ARG 0 end_ARG ⟩ ⊗ ⋯ ⊗ | start_ARG 0 end_ARG ⟩ ⊗ | start_ARG 1 end_ARG ⟩, consist of tensor products of the individual qubits. The state |ψ[c0,c1,,cN1]tket𝜓superscriptsubscript𝑐0subscript𝑐1subscript𝑐𝑁1𝑡\ket{\psi}\to\left[c_{0},\leavevmode\nobreak\ c_{1},\leavevmode\nobreak\ % \cdots,\leavevmode\nobreak\ c_{N-1}\right]^{t}| start_ARG italic_ψ end_ARG ⟩ → [ italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_c start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT possesses N=2n𝑁superscript2𝑛N=2^{n}italic_N = 2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT complex amplitudes, whose absolute squared values must sum up to one. Due to the principle of superposition, an n𝑛nitalic_n-qubit system is able to encode and process information scaling in 𝒪(2n)𝒪superscript2𝑛\mathcal{O}\left(2^{n}\right)caligraphic_O ( 2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ), while for a classical setting, it is limited to 𝒪(n)𝒪𝑛\mathcal{O}\left(n\right)caligraphic_O ( italic_n ).

Evolution of Closed Quantum Systems

In order for computation to be possible, there must be some method to manipulate quantum states. Exactly this is achieved by operators acting on the Hilbert space \mathcal{H}caligraphic_H. By definition, all operators, which describe the time evolution of a closed quantum system are reversible. Hence, they can be represented as unitary matrices, i.e., for an operator U𝑈Uitalic_U it must hold that UU=Isuperscript𝑈𝑈𝐼U^{\dagger}U=Iitalic_U start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT italic_U = italic_I. This constraint also conveys length preserving properties, i.e., applying a unitary operator to a quantum state will again yield a valid quantum state satisfying Eq. 22.

In the following, explicit matrix representations of operators are specified in the computational basis. Starting simple, consider the bit-flip operator σxsubscript𝜎𝑥\sigma_{x}italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT. This operator just flips the amplitudes of the |0ket0\ket{0}| start_ARG 0 end_ARG ⟩ and |1ket1\ket{1}| start_ARG 1 end_ARG ⟩ basis state, on the Bloch sphere this is equivalent to a rotation by π𝜋\piitalic_π about the x𝑥xitalic_x-axis. The corresponding operators also exist for y𝑦yitalic_y-axis and z𝑧zitalic_z-axis, in matrix notation those are given as

X:=σx=[0110],Y:=σy=[0ii0],Z:=σz=[1001].formulae-sequenceassign𝑋subscript𝜎𝑥matrix0110assign𝑌subscript𝜎𝑦matrix0𝑖𝑖0assign𝑍subscript𝜎𝑧matrix1001\displaystyle X:=\sigma_{x}=\begin{bmatrix}0&1\\ 1&0\end{bmatrix},\leavevmode\nobreak\ \leavevmode\nobreak\ Y:=\sigma_{y}=% \begin{bmatrix}0&-i\\ i&0\end{bmatrix},\leavevmode\nobreak\ \leavevmode\nobreak\ Z:=\sigma_{z}=% \begin{bmatrix}1&0\\ 0&-1\end{bmatrix}.italic_X := italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW end_ARG ] , italic_Y := italic_σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL 0 end_CELL start_CELL - italic_i end_CELL end_ROW start_ROW start_CELL italic_i end_CELL start_CELL 0 end_CELL end_ROW end_ARG ] , italic_Z := italic_σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL - 1 end_CELL end_ROW end_ARG ] . (29)
Refer to caption
Figure 4: Circuit symbols of various quantum operators (gates).

Allowing an additional degree of freedom, one can define an operator for arbitrary rotation with θ𝜃\thetaitalic_θ about axis i𝑖iitalic_i as

Ri(θ)=eiθ2σi,for i{x,y,z}.formulae-sequencesubscript𝑅𝑖𝜃superscript𝑒𝑖𝜃2subscript𝜎𝑖for 𝑖𝑥𝑦𝑧\displaystyle R_{i}(\theta)=e^{-i\frac{\theta}{2}\sigma_{i}},\leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \text{for }i\in\{x,y,z\}.italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_θ ) = italic_e start_POSTSUPERSCRIPT - italic_i divide start_ARG italic_θ end_ARG start_ARG 2 end_ARG italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , for italic_i ∈ { italic_x , italic_y , italic_z } . (30)

The last 1111-qubit operator we introduce is the Hadamard matrix:

H=12[1111],𝐻12matrix1111\displaystyle H=\frac{1}{\sqrt{2}}\begin{bmatrix}1&1\\ 1&-1\end{bmatrix},italic_H = divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL - 1 end_CELL end_ROW end_ARG ] , (33)

which basically performs a change of basis with H|0=|+𝐻ket0ketH\ket{0}=\ket{+}italic_H | start_ARG 0 end_ARG ⟩ = | start_ARG + end_ARG ⟩ and H|1=|𝐻ket1ketH\ket{1}=\ket{-}italic_H | start_ARG 1 end_ARG ⟩ = | start_ARG - end_ARG ⟩. By employing the tensor product for operators, we can extend 1111-qubit operators to act on single qubits comprising a multi-qubit system. We now move to genuine multi-qubit operators, acting non-trivially on two or more qubits. For our purposes, the most relevant 2222-qubit operators are the controlled X𝑋Xitalic_X (CX𝐶𝑋CXitalic_C italic_X) and controlled Z𝑍Zitalic_Z (CZ𝐶𝑍CZitalic_C italic_Z), where one qubit acts as the control and the other as the target. More concretely, the CX𝐶𝑋CXitalic_C italic_X-gate flips the amplitudes of the target qubit, iff the control is in state |1ket1\ket{1}| start_ARG 1 end_ARG ⟩. Similar to this, the CZ𝐶𝑍CZitalic_C italic_Z operator performs a conditional phase flip. The matrix notations are given by

CX=[1000010000010010] and CZ=[1000010000100001].𝐶𝑋matrix1000010000010010 and 𝐶𝑍matrix1000010000100001\displaystyle CX=\begin{bmatrix}1&0&0&0\\ 0&1&0&0\\ 0&0&0&1\\ 0&0&1&0\end{bmatrix}\text{\leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ and \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ }CZ=\begin{bmatrix}1&0&0&0\\ 0&1&0&0\\ 0&0&1&0\\ 0&0&0&-1\end{bmatrix}.italic_C italic_X = [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW end_ARG ] and italic_C italic_Z = [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL end_ROW end_ARG ] . (42)

Quantum circuit diagrams are a nice way to visualize what is going on in a quantum algorithm. The individual qubits are represented as wires, where the order of operators, also called gates, is defined by their relative position. To be more precise, the top wire gets associated with the leftmost qubit. A few common circuit symbols for the operators introduced so far are depicted in Fig. 4.

Extracting Classical Information via Measurements

In classical computing, it is trivial to observe the exact states of all bits. For quantum systems, in order to extract information, an observable quantity has to be measured. To build the bridge to quantum computing, for each physical observable there exists a Hermitian operator O𝑂Oitalic_O [NL16], i.e., it holds O=Osuperscript𝑂𝑂O^{\dagger}=Oitalic_O start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT = italic_O. The eigenstates of O𝑂Oitalic_O define a basis of the quantum system’s Hilbert space.

Once an observable O𝑂Oitalic_O is measured, the corresponding measurement device outputs an eigenvalue of O𝑂Oitalic_O. The post-measurement state of the system is given by the eigenstate corresponding to the eigenvalue that is measured. The most commonly used observable might be Pauli-Z𝑍Zitalic_Z, which corresponds to a measurement in the computational basis for a single qubit, see also Eq. 29. It has eigenvalues λ1=+1subscript𝜆11\lambda_{1}=+1italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = + 1, λ2=1subscript𝜆21\lambda_{2}=-1italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = - 1 and corresponding eigenstates v1=[1 0]tsubscript𝑣1superscriptdelimited-[]10𝑡v_{1}=\left[1\leavevmode\nobreak\ \leavevmode\nobreak\ 0\right]^{t}italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = [ 1 0 ] start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, v2=[0 1]tsubscript𝑣2superscriptdelimited-[]01𝑡v_{2}=\left[0\leavevmode\nobreak\ \leavevmode\nobreak\ 1\right]^{t}italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = [ 0 1 ] start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT.

The consequences for quantum computing are quite sobering, as observing superpositions w.r.t. the basis defined by the observable is impossible. Rather, one of the postulates of quantum mechanics states the Born rule, which defines a probabilistic relationship between quantum state and measurement output. Let |0,|1,,|N1ket0ket1ket𝑁1\ket{0},\leavevmode\nobreak\ \ket{1},\leavevmode\nobreak\ ...,\leavevmode% \nobreak\ \ket{N-1}| start_ARG 0 end_ARG ⟩ , | start_ARG 1 end_ARG ⟩ , … , | start_ARG italic_N - 1 end_ARG ⟩ be the basis defined by observable O𝑂Oitalic_O and c0,c1,,cNsubscript𝑐0subscript𝑐1subscript𝑐𝑁c_{0},\leavevmode\nobreak\ c_{1},\leavevmode\nobreak\ ...,\leavevmode\nobreak% \ c_{N}italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT the corresponding amplitudes of state |ψket𝜓\ket{\psi}| start_ARG italic_ψ end_ARG ⟩ expressed in this basis. It holds, that measuring O𝑂Oitalic_O will result in the measurement outcome λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with probability |ci|2superscriptsubscript𝑐𝑖2\absolutevalue{c_{i}}^{2}| start_ARG italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Consequently, having obtained λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the post-measurement state of the system is |iket𝑖\ket{i}| start_ARG italic_i end_ARG ⟩.

The first algorithm claiming provable quantum advantage, i.e., an improvement w.r.t. some complexity metric compared to any classical approach, was published in 1992 by Deutsch and Josza [DJ92] for a specially constructed problem. Most famous might be Shor’s algorithm [Sho97], which provides an exponential speedup for tasks like prime factorization. Unfortunately, it requires large-scale, fault-tolerant and error-corrected quantum computers. All current hardware can be considered NISQ devices, which makes the execution of these algorithms infeasible. Despite this, the first claim of experimental quantum advantage was published just two years ago [Aru+19]. Yet, the considered problem was quite far from general practical applicability. A demonstration for achievable quantum supremacy on a practically relevant problem has still to be given. There are some promising candidates like quantum chemistry and material science. Recently, ideas have been put forward on combining quantum computing and machine learning [Ben+20, SP18]. These algorithms are expected to bypass at least some of the problems with execution on presently available NISQ hardware.

Quantum Machine Learning with Variational Quantum Circuits
Refer to caption
Figure 5: Variational quantum circuit consisting of feature map, variational layer, and measurement.

The research on QML just really took off in the last two decades, yet there exists already a variety of approaches. As a rough clue, the hoped-for benefit of QML relies, to a large extent, on the access to the high dimensional Hilbert space granted by quantum systems. Here, we want to briefly collect the background for the summaries of VQC-based QRL approaches in Sec. 4.2.

QML frequently deals with expectation values of quantum measurements. The expectation value of an observable O𝑂Oitalic_O w.r.t. the quantum state |ψket𝜓\ket{\psi}| start_ARG italic_ψ end_ARG ⟩ is denoted as

Oψ:=ψ|O|ψ.assignsubscriptexpectation-value𝑂𝜓expectation-value𝜓𝑂𝜓\displaystyle\expectationvalue{O}_{\psi}:=\expectationvalue{\psi\left|O\right|% \psi}.⟨ start_ARG italic_O end_ARG ⟩ start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT := ⟨ start_ARG italic_ψ | italic_O | italic_ψ end_ARG ⟩ . (43)

While VQCs define a new class of ML models, one can make the case for the loose analogy to NNs, where the relation of in- and output depends on a set of weights. An example for a parameterized quantum operator is given in Eq. 30. The corresponding gate applies a rotation about a specific axis by some angle θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Multiple rotation gates form a quantum circuit, where 𝜽𝜽\boldsymbol{\theta}bold_italic_θ summarizes all free parameters. Varying these values gives the possibility to determine the evolution of the quantum system. Let Uθsubscript𝑈𝜃U_{\theta}italic_U start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT denote the corresponding unitary. An schematic example of a VQC is displayed in Fig. 5. Most RL tasks use the concept of states, based on which an informed decision should be taken. This state information is encoded into the quantum system with an appropriate feature map. In general, the inputs s𝑠sitalic_s are pre-processed with some map** function ΦΦ\Phiroman_Φ. The results Φ(s)Φ𝑠\Phi(s)roman_Φ ( italic_s ) can be neatly integrated into the quantum circuit via the unitary UΦ(s)subscript𝑈Φ𝑠U_{\Phi(s)}italic_U start_POSTSUBSCRIPT roman_Φ ( italic_s ) end_POSTSUBSCRIPT. To enhance the expressive power of the VQC, one can use more sophisticated data encoding routines like data re-uploading [Pér+20] or incremental data-uploading [Per+22]. Eventually, some observable has to be measured. A common choice is the computational basis with O=Zn𝑂superscript𝑍tensor-productabsent𝑛O=Z^{\otimes n}italic_O = italic_Z start_POSTSUPERSCRIPT ⊗ italic_n end_POSTSUPERSCRIPT. Overall, the output of the VQC-model can be described as

Os,θ=subscriptexpectation-value𝑂𝑠𝜃absent\displaystyle\expectationvalue{O}_{s,\theta}=⟨ start_ARG italic_O end_ARG ⟩ start_POSTSUBSCRIPT italic_s , italic_θ end_POSTSUBSCRIPT = 0|(UθUΦ(s))OUθUΦ(s)|0expectation-value0superscriptsubscript𝑈𝜃subscript𝑈Φ𝑠𝑂subscript𝑈𝜃subscript𝑈Φ𝑠0\displaystyle\expectationvalue{0\left|\left(U_{\theta}U_{\Phi(s)}\right)^{% \dagger}OU_{\theta}U_{\Phi(s)}\right|0}⟨ start_ARG 0 | ( italic_U start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT roman_Φ ( italic_s ) end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT italic_O italic_U start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_U start_POSTSUBSCRIPT roman_Φ ( italic_s ) end_POSTSUBSCRIPT | 0 end_ARG ⟩ (44)
:=assign\displaystyle:=:= 0|Us,θOUs,θ|0.expectation-value0superscriptsubscript𝑈𝑠𝜃𝑂subscript𝑈𝑠𝜃0\displaystyle\expectationvalue{0\left|U_{s,\theta}^{\dagger}OU_{s,\theta}% \right|0}.⟨ start_ARG 0 | italic_U start_POSTSUBSCRIPT italic_s , italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT italic_O italic_U start_POSTSUBSCRIPT italic_s , italic_θ end_POSTSUBSCRIPT | 0 end_ARG ⟩ .

For most tasks, this value is post-processed using some function f𝑓fitalic_f. Kee** things as general as possible, one can define a loss function \mathcal{L}caligraphic_L on f(Os,θ)𝑓subscriptexpectation-value𝑂𝑠𝜃f\left(\expectationvalue{O}_{s,\theta}\right)italic_f ( ⟨ start_ARG italic_O end_ARG ⟩ start_POSTSUBSCRIPT italic_s , italic_θ end_POSTSUBSCRIPT ) (based on the concrete problem at hand). The update of the parameters can be performed using, e.g., gradient-based techniques:

θθ+αθ(f(Os,θ))𝜃𝜃𝛼subscript𝜃𝑓subscriptexpectation-value𝑂𝑠𝜃\displaystyle\theta\leftarrow\theta+\alpha\cdot\nabla_{\theta}\mathcal{L}\left% (f(\expectationvalue{O}_{s,\theta})\right)italic_θ ← italic_θ + italic_α ⋅ ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L ( italic_f ( ⟨ start_ARG italic_O end_ARG ⟩ start_POSTSUBSCRIPT italic_s , italic_θ end_POSTSUBSCRIPT ) ) (45)

The required gradient can be obtained using the parameter-shift rule [Cro19, Wie+22a], or SPSA-based approximations [Wie+23].

4 Quantum Reinforcement Learning Algorithms

In QML, there are approaches that either aim to stabilize the coherent function of the QPU using ML methods, or use the structure of a hybrid variational algorithm for ML purposes. Very often, RL is used to generate a solution for a quantum control problem, e.g., to learn quantum error correction strategies [Fös+18] or to generate control policies at a lower level [Zha+19, Dal+20]. Other work considers RL as the optimizer of a variational quantum algorithm (VQA) [Kha+19, Kha+20]. While this represents a fascinating research topic in itself, here we will focus on the application of QRL algorithms for solving specific tasks, be they classical or quantum. Research in the field of QML has so far mostly focused on supervised and unsupervised learning. However, the literature already proposes quite a few theoretical concepts and even some small-scale experimental realizations for QRL. Recent developments mostly focus on employing VQCs as function approximators. When transferring from RL to QRL, i.e., the ‘quantization’ of the RL paradigm, there are various possibilities of how quantum computing enters the game. This has led to the development of different QRL variants. A few works exist, that review current progress in QRL [KSG21, ML22, Kun22, Lam23, NHP23] and the more general correspondence of RL and QC [ML21]. There is also recent work towards a fair comparison of RL and QRL in restricted settings [MK21, Fra+22].

Quantum-Inspired Approaches. The earliest idea for combining RL with a quantum routine relies on the method of amplitude amplification, as it is used in Grover-type search algorithms [CDC06, Don+08a, Don+06a, Che+06, Don+06, CD08, Don+08, CD10, CFD12, Fak+13, NGC15, Li+20, Nir+21, LAD21, Yin+21, Hu+21, Ren+22, Cho+23]. Several qubit registers embed the states and actions relevant for the RL system in a suitable Hilbert space. Starting from a uniform superposition, amplitudes favored by the reward or the value function are selectively amplified. The action selection is based on Born’s rule, i.e., a measurement is carried out on the qubit register with regard to the ‘action-basis’. The algorithm was also investigated independently of QPUs [Don+12] and recently further developed [GH19]. An introduction to this concept is also provided in Ref. [Raj+21]. As it turns out, these early variants should rather be considered a set of QiRL algorithms, that do not offer an intrinsic potential for quantum advantage. Recently, the technique was transferred to sampling from the experience replay buffer in Q𝑄Qitalic_Q-learning [Wei+21]. A summary and review of this type of QiRL can be found in Sec. 4.1.

VQC-Based Function Approximation. In DRL, deep neural networks (DNNs) are employed as powerful function approximators. Typically, the approximation either happens in policy space (actor), in value space (critic), or both, resulting in so-called actor-critic approaches. Recently, VQCs were proposed and analyzed in their role as function approximators in the RL setting – an extensive overview is provided in Sec. 4.2. On the one hand, this approach basically replaces a more or less well understood heuristic with a poorly understood heuristic. For the quantum heuristic many open questions regarding computational power, scalability and trainability remain. On the other hand, VQCs nonetheless have spurred the hope for quantum advantage already with NISQ devices. The earliest work in this direction proposed VQC-based approximation in value space, which is covered in Sec. 4.2.1. This so-called VQC-based Q𝑄Qitalic_Q-learning was introduced in Ref. [Che+20], and extended in Refs. [LS20, LS21, Lok+22, Che23b, CCC23, FP+23, SJD22, Sko+23, LXJ23]. A method to efficiently evaluate the Q𝑄Qitalic_Q-function is discussed in Ref. [San+23], which is however not entirely NISQ-feasible. The complimentary approach of approximation in policy space is discussed in Sec. 4.2.2. Originally proposed in Ref. [Jer+21], several extensions have bee discussed in Refs. [Kun22, BAQ23, SSB23, Jer+23, Mey21, Mey+23a, Mey+23]. Combinations of value and policy approximation are covered in Sec. 4.2.3, with (soft) actor-critic approaches in Refs. [Wu+23, Kwa+21, Ree23, Che23, Lan21], and multi-agent formulations in Refs. [Yun+22, YPK23]. The setting of offline quantum reinforcement learning is considered in Sec. 4.2.4 by Refs.[Per+23, Che+23]. A collection of algorithmic and conceptual extensions that are relevant for a wide range of approaches is composed in Sec. 4.2.5, based on Refs. [Che23c, Che23a, Kim+21, Hsi+22, Dră+22, Kru+23, SMT23, ACN23, PPR20, Che+22, DS23, Köl+23]. A collection of application-focused work is summarized in Sec. 4.2.6, comprising Refs. [Acu+22, Hei+22, Cob23, BYK22, SMK23, Hic+23, KCP23, Cor+23, San+22, ACN22, Liu+23, Kum+23, Rai+23, SH23, RKM22, Yan+22, Par+23, NS+23, Par+23a, PK23, Yun+23, Ans+23, Che+23b, Yan23, CRC23, Che23d].

Projective Simulation. Another QRL method is based on projective simulation (PS), which in the broadest sense is a particular learning paradigm and similar in spirit to RL [BD12]. Based on experiences made through interaction with the environment, a memory network is created by the agent. The network has a directed structure with adaptive weights between the nodes of the network. The learning process and action selection are based on a random process (more precisely, a random walk) on the graph of the network, with the transition probabilities between nodes being given by the respective adaptive weights. PS can be ‘quantized’ by replacing the random walk with a so-called quantum random walk  [Pap+14, Tei21, TRC21, Mel+17]. A formal analysis of convergence properties was given in Ref. [Boy+20]. In fact, there is already work on a proof-of-principle implementation in the laboratory [DFB15, Sri+18] and proposals for quantum-optics implementations [Fla+23]. Possible quantum advantages over classical PS lie in the acceleration of the process of action selection, also referred to as deliberation in the literature. A more detailed summary is provided in Sec. 4.3.

Quantum Boltzmann Machines. Another line of research proposes to use Boltzmann machines as function approximators. These models are assumed to be advantageous compared to typical NNs in environments with large action spaces. Ref. [Jer+21a] demonstrates, that Boltzmann machines are closely related to energy-based models. For specific instances, those allow for a quantum representation, which enables potential quantum speed-up for post-NISQ devices. A similar concept is also proposed for the annealing-based QC paradigm [Cra+18, Sch+22, Lev+17]. A summary of these ideas can be found in Sec. 4.4.

Quantum Subroutines. Another approach to go from RL to QRL replaces certain subroutines in existing RL approaches. One idea is to replace policy or value iteration with some quantum-enhanced analogues. While this approach is limited to universal, fault-tolerant and error-corrected quantum hardware, several such algorithms have been proposed and analyzed [Wie21, Wie+22, Wan+21, CKP23, Gan+23, Zho+23, GA23]. Most importantly, these algorithms come with guarantees regarding speed-up, compared to their classical counterparts. QRL in these settings is often limited to the tabular case and assumes a quantum version of the RL environment, i.e., oracle access. Our summaries and reviews can be found in Sec. 4.5.

Full-Quantum Formulation. An approach which not only ‘quantizes’ certain subroutines, but all components of the pipeline, is considered in Refs. [DTB15, DTB16, DTB17, Dun+18]. Extensions [HDW21, HW22], applied to specific problems [Wan+21a, Wan+23], and small-scale experimental realizations [Sag+21] were presented. An alternative route to fully quantized QRL was taken in [Cor18]. For our review of this line of research, see Sec. 4.6.

Various Concepts. For the sake of completeness, we mention different approaches found in the literature. We note, however, that we did not pursue a detailed review for those works, typically because we focused on what we identified as the most considered lines of research. While some of the works listed in the following simply do not fit directly with the learning-based QRL approach, for others it might not seem obvious how to generalize their particular setting to a broader class of problems. While quantum algorithms for dynamic programming have been discussed [Ron19, Amb+19], it currently remains unclear how to move from dynamic programming to a learning-based approach such as RL. Similarly, quantum algorithms have been employed to solve planning tasks [NW05], but again the transfer to a learning-based approach is far from obvious. Closer related to the typical RL setting is the task of imitation learning [Che+23a]. A series of papers discussed QRL in the setting of photonic circuits, see and Refs. [HH19, HH19b, HH19a, HH19c, SH20] and Refs. [Fla+20, Lam21, Sag+21a, Nag+21, Shi+22], with the connection to superconducting qubits established in Ref. [Lam17, Cár+18]. Another approach, which we did not review in detail, is given by combining RL with the paradigm of quantum annealing [Neu+17, AHF20, Neu+20, Mül+21, FH23, NY23]. Strategies have been developed to address the classical and quantum version of contextual bandits [LHT22, LJW22, BLT23, BKS23]. Furthermore, a quantum version of the classical RL benchmark environment CartPole has been formulated [WAU20, Mei+23]. Similarly, various interpretations of QRL for specialized tasks in the quantum domain exist [Alv+16, Alv+18, Bha+19, Alb+18, Alb+20, She+20, Oli+20, Liu+22, ÇY23]. Different approaches have been proposed for combining RL with quantum walks [Che+19, Dal+22, MVB22]. Further work on optimization tasks rather than RL, such as Ref. [Ram17, Jaš+19, Bel+20], have not been reviewed in detail. An interesting interpretation of self-learning physical machines is discussed in [LM23], which potentially could be brought into line with QRL.

4.1 Quantum-Inspired Reinforcement Learning based on Amplitude Amplification

Citation First Author Title

[Don+08a]

D. Dong

Quantum reinforcement learning

[Don+06]

D. Dong

Quantum mechanics helps in learning for more intelligent robots

[CDC06]

C.-L. Chen

Quantum computation for action selection using reinforcement learning

[Don+06a]

D. Dong

Quantum Robot: Structure, Algorithms and Applications

[Che+06]

C.-L. Chen

Superposition-Inspired Reinforcement Learning and Quantum Reinforcement Learning

[CD08]

C.-L. Chen

A Quantum Reinforcement Learning Method for Repeated Game Theory

[Don+08]

D. Dong

Incoherent Control of Quantum Systems With Wavefunction-Controllable Subspaces via Quantum Reinforcement Learning

[CD10]

C.-L. Chen

Complexity analysis of Quantum reinforcement learning

[Don+12]

D. Dong

Robust Quantum-Inspired Reinforcement Learning for Robot Navigation

[CFD12]

C. Chunlin

Hybrid control of uncertain quantum systems via fuzzy estimation and quantum reinforcement learning

[Fak+13]

P. Fakhari

Quantum inspired reinforcement learning in changing environment

[NGC15]

S. Nuuman

A quantum inspired reinforcement learning technique for beyond next generation wireless networks

Table 2: [Part 1] Work considered for “QiRL based on amplitude Amplification” (Sec. 4.1)
Citation First Author Title

[GH19]

M. Ganger

Quantum Multiple Q-Learning

[Li+20]

J.-A. Li

Quantum reinforcement learning during human decision-making

[LAD21]

Y. Li

Intelligent Trajectory Planning in UAV-Mounted Wireless Networks: A Quantum-Inspired Reinforcement Learning Perspective

[Raj+21]

K. Rajagopal

Quantum Amplitude Amplification for Reinforcement Learning

[Nir+21]

D. Niraula

Quantum deep reinforcement learning for clinical decision support in oncology: application to adaptive radiotherapy

[Wei+21]

Q. Wei

Deep Reinforcement Learning With Quantum-Inspired Experience Replay

[Yin+21]

L. Yin

Quantum deep reinforcement learning for rotor side converter control of double-fed induction generator-based wind turbines

[Hu+21]

Y. Hu

Quantum-enhanced reinforcement learning for control: a preliminary study

[Ren+22]

Y. Ren

NFT-Based Intelligence Networking for Connected and Autonomous Vehicles: A Quantum Reinforcement Learning Approach

[Cho+23]

B. Cho

Quantum bandit with amplitude amplification exploration in an adversarial environment

Table 3: [Part 2] Work considered for “QiRL based on amplitude Amplification” (Sec. 4.1)
Quantum reinforcement learning, Dong et al. (2008) and related work


Summary. Ref. [Don+08a] discusses a new RL algorithm that is inspired by the superposition principle of quantum mechanics. The authors propose an algorithm that modifies the action-selection procedure and balances exploration and exploitation in a novel way. The authors present their ideas in modified form in a sequence of papers, see Refs. [Don+08a, Don+06, Don+06a, Che+06, CD08, CDC06, Don+08, CD10, Don+12, CFD12, Fak+13, NGC15, GH19, Li+20, LAD21, Hu+21], for an overview see also [Raj+21]. The original work [Don+08a] discusses how to execute the proposed algorithm on actual quantum devices – which, however, did not exist at this time. As discussed also below, it is not clear how to run the algorithm in quantum superposition, and if this is possible in practice without taking away potential quantum advantage. Despite these doubts the proposed concepts enhance classical RL with ideas from QC, which leads us to view this approach as QiRL.

Algorithmic Concepts and Extensions. Initially, the algorithm is formulated as merely quantum inspired in Ref. [CD08] (i.e., it is developed for a classical computer that simulates a quantum superposition). The motivation is to design an algorithm with better exploration-exploitation trade-off compared to e.g. ϵitalic-ϵ\epsilonitalic_ϵ-greedy action selection. The underlying routine is a modification of temporal difference (TD), more concretely TD (0) in the following way: For each state the set of possible actions is in a ‘superposition’ and the agent (in state s𝑠sitalic_s) now selects an action with a given probability. The action is taken and the new state ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and reward r𝑟ritalic_r is observed. Afterwards, the probability of the taken action is increased by k(r+V(s))𝑘𝑟𝑉superscript𝑠k(r+V(s^{\prime}))italic_k ( italic_r + italic_V ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ), where V(s)𝑉superscript𝑠V(s^{\prime})italic_V ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is the value function of state ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and k𝑘kitalic_k is a hyperparameter. The term r+V(s)𝑟𝑉superscript𝑠r+V(s^{\prime})italic_r + italic_V ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) samples a quantity similar to Q(s,a)𝑄𝑠𝑎Q(s,a)italic_Q ( italic_s , italic_a ). Consequently, the update creates a probability distribution, where for a given state the probability to select an action increases as the value of Q(s,a)𝑄𝑠𝑎Q(s,a)italic_Q ( italic_s , italic_a ) increases. Therefore, this action selection process corresponds to sampling from a stochastic policy dependent on the value of the state-action pairs.

Now the algorithm is translated to be run on a quantum computer. The stochastic policy is replaced by a quantum superposition. That is, for each state s𝑠sitalic_s the possible actions are represented by the eigenstates of some observable and a superposition of these states is created. If the observable is measured, the state will collapse to an eigenstate associated with an action which will be taken by the agent and therefore constitutes the selection process. After receiving the reward and the new state, the Grover operator is applied L=min{k(r+V(s)),Lmax}𝐿min𝑘𝑟𝑉superscript𝑠subscript𝐿maxL=\mathrm{min}\{k(r+V(s^{\prime})),L_{\mathrm{max}}\}italic_L = roman_min { italic_k ( italic_r + italic_V ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) , italic_L start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT } times to a copy of the superposition state to enhance the amplitude corresponding to the previous selected action. The variable Lmaxsubscript𝐿maxL_{\mathrm{max}}italic_L start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT guarantees that the Grover operator is not applied too many times. Note that repeated application of the procedure requires a new copy of the state after each measurement. Due to the no cloning theorem, this could be realized by many different independent copies of the initial memory, or by a purely classical representation of the states. The latter realization reduces the algorithm to the initial proposal of a quantum-inspired action selection process.

In Ref. [Don+12] the QiRL algorithm is applied to robot navigation. It is stated explicitly that QiRL is a classical action-selection method that differs from the ideas of QRL, which in principle could benefit from a quantum computer. In Ref. [GH19] the algorithms are generalized to Q𝑄Qitalic_Q-learning and double- and multiple Q𝑄Qitalic_Q-learning. Also these approaches should be understood in the context of QiRL. Finally, Refs. [Li+20, Nir+21] apply QiRL to human decision making behavior, Ref. [Yin+21] to a complex control task, and Ref. [Ren+22] to autonomous vehicles. Recently, the quantum-inspired approach to action selection in RL was transferred to experience replay buffer sampling in Q𝑄Qitalic_Q-learning [Wei+21].

Remarks. Although it is mentioned in Ref. [Don+08a, Don+06, CD08, CDC06] that the whole algorithm could be run in quantum superposition on a quantum device, no details of such kind of genuine QRL algorithm are given. Overall, it is unclear if such an algorithm might exist. Indeed, subsequent work focuses on the QiRL paradigm.

The claims made in Refs. [Don+08a, Don+06, Don+06a, CD08, CDC06, Don+08, Don+12, GH19, Li+20, LAD21, Cho+23] can be summarized as follows: speed-up in learning by better balancing exploration-exploitation; less GPU power needed on classical computer compared to algorithms like classical Q𝑄Qitalic_Q-learning; more robust against changes of learning rate. More experiments on larger environments for deeper insights into the scaling of the algorithm and a rigorous complexity analysis would be an interesting topic for future work.

4.2 Quantum Reinforcement Learning with Variational Quantum Circuits

This section summarizes the state-of-the-art on VQC-based RL. Several ideas have been proposed in this field, with extensions in different directions. Their common ground is the usage of a VQC as parameterized function approximator.

The typical hybrid pipeline is summarized in Fig. 6. It was originally proposed for Q𝑄Qitalic_Q-function approximation by Chen et al. [Che+20] and extended to policy approximation by Jerbi et al. [Jer+21]. Other work proposes several modifications to this pipeline, which we will describe in the respective summaries. The algorithm must be understood as hybrid, as a lot of the work, especially the optimization, is executed on classical hardware. The agent observes the current state of the environment stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and applies some pre-processing ϕitalic-ϕ\phiitalic_ϕ. The result is encoded using the feature map Uϕ(s)subscript𝑈italic-ϕ𝑠U_{\phi(s)}italic_U start_POSTSUBSCRIPT italic_ϕ ( italic_s ) end_POSTSUBSCRIPT. With the current variational parameters θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, a quantum state is prepared and a (potentially action-dependent) observable Oasubscript𝑂𝑎O_{a}italic_O start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT is measured. The expectation value Oas,θsubscriptexpectation-valuesubscript𝑂𝑎𝑠𝜃\expectationvalue{O_{a}}_{s,\theta}⟨ start_ARG italic_O start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG ⟩ start_POSTSUBSCRIPT italic_s , italic_θ end_POSTSUBSCRIPT can be post-processed to represent, e.g., a state-action value function Qθ(s,a)subscript𝑄𝜃𝑠𝑎Q_{\theta}(s,a)italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s , italic_a ), or the policy πθ(a|s)subscript𝜋𝜃conditional𝑎𝑠\pi_{\theta}(a|s)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a | italic_s ). Depending on the instance, the agent employs this function to sample an action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and executes it in the environment. The reward rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (and potentially also the consecutive state st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT) is observed by the classical optimizer. To enable gradient-based parameter updates, an additional hybrid module uses the parameter-shift rule [Cro19, Wie+22a] to compute the gradients of the VQC outputs w.r.t. the variational parameters θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The classical optimizer determines the new parameter set θt+1subscript𝜃𝑡1\theta_{t+1}italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT and instantiates the VQC with these updated parameters. This overall iterative procedure of environment interaction, function approximation, and parameter update is repeated for several episodes, in the same way as for, e.g., DRL.

Unfortunately, thus far there is no guaranteed quantum advantage for this approach, apart from some cryptography inspired artificial datasets [Jer+21, SJD22]. However, several of the papers and preprints summarized in this section demonstrate promising experimental results.

Refer to caption
Figure 6: Hybrid quantum-classical agent in a typical VQC-based RL pipeline. This idea was first proposed by Chen et al. [Che+20] for Q𝑄Qitalic_Q-function approximation and extended by Jerbi et al. [Jer+21] to policy approximation. The QPU is used to approximate the respective function, while pre- and post-processing and optimization happens on classical hardware. The interaction with the environment depends on the concrete problem instance (e.g. classical or quantum environment).

4.2.1 Value-Function Approximation

This section covers VQC-based approximations in value space, as described for the instance of classical Q𝑄Qitalic_Q-learning in Eqs. 11 and 12. The work by Chen et al. [Che+20] was indeed the first proposal of this type of approximation-based techniques, which was reproduced and extended in Refs. [Lok+22, CCC23, Che23b, FP+23]. A modification of the state encoding procedure has been discussed in Lockwood and Si [LS20], and was up-scaled in another work by the same authors [LS21]. A slight reformulation of the technique – which comes with a provable advantage for very specific scenarios – can be found in Skolik et al. [SJD22]. An analysis of noise influence for this framework is discussed in Ref. [Sko+23]. An extension to environments with continuous action spaces is proposed in Ref. [LXJ23]. Ideas based on amplitude amplification to efficiently evaluate the approximated Q𝑄Qitalic_Q-function have been introduced in Ref. [San+23], which however can not be realized given the current hardware restrictions.

Citation First Author Title

[Che+20]

S. Y.-C. Chen

Variational Quantum Circuits for Deep Reinforcement Learning

[Lok+22]

S. Lokes

Implementation of Quantum Deep Reinforcement Learning Using Variational Quantum Circuits

[Che23b]

S. Y.-C. Chen

Quantum deep Q learning with distributed prioritized experience replay

[CCC23]

H.-Y. Chen

Deep-Q Learning with Hybrid Quantum Neural Network on Solving Maze Problems

[FP+23]

G. Fikadu Tilaye

Investigating the Effects of Hyperparameters in Quantum-Enhanced Deep Reinforcement Learning

[LS20]

O. Lockwood

Reinforcement Learning with Quantum Variational Circuits

[LS21]

O. Lockwood

Playing Atari with Hybrid Quantum-Classical Reinforcement Learning

[SJD22]

A. Skolik

Quantum agents in the Gym: a variational quantum algorithm for deep Q𝑄Qitalic_Q-learning

[Sko+23]

A. Skolik

Robustness of quantum reinforcement learning under hardware errors

[LXJ23]

Y. Liu

Reinforcement Learning for Continuous Control: A Quantum Normalized Advantage Function Approach

Table 4: Work considered for “QRL with VQCs– Value-Function Approximation” (Sec. 4.2.1)
Variational Quantum Circuits for Deep Reinforcement Learning, Chen et al. (2020) and related work


Summary. This paper by Chen et al. [Che+20] represents the first attempt to utilize VQCs for RL. This is done in the context of using VQCs as function approximators for the state-action value function. The authors perform simulations on simple benchmark environments and report.

Hybrid Algorithm. The algorithm is inspired by deep Q𝑄Qitalic_Q-learning (DQL) [Mni+15], where a DNN represents the Q𝑄Qitalic_Q-function. The authors replace the DNN by a VQC. The update is performed w.r.t. the mean square error (MSE) loss function (θ)=𝔼[(rt+γmaxaQθ(st+1,a)Qθ(st,at))2]𝜃𝔼delimited-[]superscriptsubscript𝑟𝑡𝛾subscriptmaxsuperscript𝑎subscript𝑄superscript𝜃subscript𝑠𝑡1superscript𝑎subscript𝑄𝜃subscript𝑠𝑡subscript𝑎𝑡2\mathcal{L}(\theta)=\mathbb{E}[\left(r_{t}+\gamma\cdot\mathrm{max}_{a^{\prime}% }\leavevmode\nobreak\ Q_{\theta^{{}^{\prime}}}(s_{t+1},a^{\prime})-Q_{\theta}(% s_{t},a_{t})\right)^{2}]caligraphic_L ( italic_θ ) = blackboard_E [ ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_γ ⋅ roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] using, e.g., gradient descent. Additionally, experience replay and target networks (second set of parameters θsuperscript𝜃\theta^{{}^{\prime}}italic_θ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT) are employed to address the instabilities stemming from bootstrap** the value function, forming a double deep Q𝑄Qitalic_Q-learning (DDQL) algorithm. Fig. 7 gives the complete algorithm.

Refer to caption
Figure 7: Hybrid algorithm proposed by and taken from Chen et al. [Che+20]; This algorithm uses a VQC to approximate the state-action value function and follows the typical steps of DQL. Note, that the authors notation for the Q𝑄Qitalic_Q-function slightly deviates from our conventions.

VQC Architecture. The feature map uses simple computational basis encoding on individual qubits. More concretely, the RL state is interpreted as bitstring, which can be encoded using the identity Rz(π)Rx(π)|0=|1subscript𝑅𝑧𝜋subscript𝑅𝑥𝜋ket0ket1R_{z}(\pi)R_{x}(\pi)\ket{0}=\ket{1}italic_R start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_π ) italic_R start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_π ) | start_ARG 0 end_ARG ⟩ = | start_ARG 1 end_ARG ⟩. The entanglement structure connects nearest neighbors with CZ𝐶𝑍CZitalic_C italic_Z gates. The variational parameters are incorporated in single qubit rotations about the x𝑥xitalic_x, y𝑦yitalic_y, and z𝑧zitalic_z axis. The state-action value is decoded by measuring Pauli-Z𝑍Zitalic_Z observables on a number of qubits, that corresponds to the number of actions in the environment. The full VQC is visualized in Fig. 8.

Refer to caption
Figure 8: VQC proposed by and taken from Chen et al. [Che+20]; The Rxsubscript𝑅𝑥R_{x}italic_R start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and Rzsubscript𝑅𝑧R_{z}italic_R start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT gates are used for state encoding. Several parameterized layers (dashed box) are repeated to form the Q𝑄Qitalic_Q-function approximator. The values of the function are decoded using 1111-qubit Pauli-Z𝑍Zitalic_Z observables.

Experimental Results and Discussion. The proposed VQC-DQL algorithm is simulated for two environments. The first one is FrozenLake, with 16161616 states and an 4444 actions. The second one is CognitiveRadio, which is adapted to VQCs sizes of 2222 to 5555 qubits. The authors report that their VQC-based agent performs at least equally well as a NN. Moreover, they claim that this requires fewer parameters (about one order of magnitude compared to DNNs), which points towards potential quantum advantage. The model is tested on actual quantum hardware with competitive results.

Remarks. The employed encoding scheme (computational basis encoding) could be simplified by omitting the RZsubscript𝑅𝑍R_{Z}italic_R start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT rotations, as these only introduce a global phase. The CognitiveRadio environment might be oversimplified. We also note that the claim on reduced parameter count should be substantiated by experiments with environments of different scale.

Reproduction. A reproduction study by Lokes et al. [Lok+22] conducts an extended hyperparameter search for the described setup. The results and claims are overall consistent with [Che+20], but no novel findings could were reported.

Extension. In the work by S. Y.-C. Chen [Che23b] the quantum Q𝑄Qitalic_Q-learning framework introduced in [Che+20] is extended by incorporating prioritized experience replay. Additionally, an asynchronous training routine is employed, similar to the one discussed in [Che23]. Both techniques reduce the overall sampling complexity and therefore allow for solving more complex tasks with the same underlying quantum model. This is validated with numerical simulations on several versions of the CartPole environment.

Hybrid Model. The work by Chen et al. [CCC23] extends the quantum models used in [Che+20] with classical neural networks, to produce more expressive function approximators. With that extension, the quantum agent is able to solve a 20×20202020\times 2020 × 20 gridworld maze, which should clearly be more complex than the originally considered FrozenLake environment. However, with the provided analysis it in unclear to which extend the performance can be contributed to the quantum part of the model.

Hyperparameter Analysis. A hyperparamter analysis is conducted by Fikadu Tilaye and Pandey [FP+23], with a focus on the Q𝑄Qitalic_Q-learning framework introduced in [Che+20]. The authors conclude, that deeper quantum circuits lead to a better overall performance, while a larger learning rate speeds up the overall process. However, the analysis is superficial and quite small-scale, so further investigations are necessary to allow for more general statements.

Table 5: *

Algorithmic Characteristics - Chen et al. [Che+20] Environment Algorithm Quantum State Action Qubits Parameterized Type Component Space Space  Gates111 encoding gates: qubits×per_qubit𝑞𝑢𝑏𝑖𝑡𝑠𝑝𝑒𝑟normal-_𝑞𝑢𝑏𝑖𝑡qubits\times per\_qubititalic_q italic_u italic_b italic_i italic_t italic_s × italic_p italic_e italic_r _ italic_q italic_u italic_b italic_i italic_t; variational gates: qubits×layers×per_qubit_per_layer𝑞𝑢𝑏𝑖𝑡𝑠𝑙𝑎𝑦𝑒𝑟𝑠𝑝𝑒𝑟normal-_𝑞𝑢𝑏𝑖𝑡normal-_𝑝𝑒𝑟normal-_𝑙𝑎𝑦𝑒𝑟qubits\times layers\times per\_qubit\_per\_layeritalic_q italic_u italic_b italic_i italic_t italic_s × italic_l italic_a italic_y italic_e italic_r italic_s × italic_p italic_e italic_r _ italic_q italic_u italic_b italic_i italic_t _ italic_p italic_e italic_r _ italic_l italic_a italic_y italic_e italic_r; FrozenLake DDQL Q𝑄Qitalic_Q-function discrete discrete 4444 4×2424\times 24 × 2 (encoding) (OpenAI Gym) 16161616 4444 4×4×34434\times 4\times 34 × 4 × 3 (weights) CognitiveRadio DDQL Q𝑄Qitalic_Q-function discrete discrete n𝑛nitalic_n n×2𝑛2n\times 2italic_n × 2 (encoding) (see [Che+20]) n2superscript𝑛2n^{2}italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT n𝑛nitalic_n n×4×3𝑛43n\times 4\times 3italic_n × 4 × 3 (weights)

Reinforcement Learning with Quantum Variational Circuits, Lockwood and Si (2020)


Summary. The work by Lockwood and Si [LS20] modifies several aspects of the routine proposed by Chen et al. [Che+20]. Most importantly, they introduce two new encoding schemes to deal with a continuous state space.

Modification of Architecture. The first proposed encoding is denoted as scaled encoding. It scales the RL state values to the range [0,2π)02𝜋[0,2\pi)[ 0 , 2 italic_π ), which are then encoded using some 1111-qubit parameterized rotations. The second on (so-called directional encoding) only encodes the sign of the value. More concretely, if a state variable is positive, Rxsubscript𝑅𝑥R_{x}italic_R start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and Rzsubscript𝑅𝑧R_{z}italic_R start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT rotations by π𝜋\piitalic_π are applied to the encoding qubit (following a similar idea as the computational state encoding [Che+20]).

The architecture for the variational layer consists of an entangling block (nearest-neighbor CX𝐶𝑋CXitalic_C italic_X gates) and parameterized 1111-qubit rotations about x𝑥xitalic_x, y𝑦yitalic_y, and z𝑧zitalic_z axis. This block is repeated three times. For decoding the state-action value, the authors employ two different strategies. The first one feeds the measurement result into a classical fully-connected layer where the number of outputs corresponds to the number of possible actions. In the other case, a so-called quantum pooling operation, condenses the information of the quantum state into a subset of the qubits [CCL19]. This allows for a more flexible architecture, independent of the number of actions in the environment.

Experimental Results. The proposed algorithm and the encoding schemes are benchmarked on the CartPole and Blackjack environment. While the former one uses a combination of scaled and directional encoding, the second one only employs scaled encoding. Their findings agree with those reported previously in the literature, namely that VQC-based models achieve similar performance to NN-based function approximators. As also stated by Chen et al. [Che+20], the usage of VQCs reduces the required parameter complexity.

Remarks. While the scaled encoding should be a sound choice, the directional encoding could be inappropriate for most environments. Usually, not only the sign of a specific state is relevant, but the concrete state contains relevant information. With this encoding, this information is lost, which should lead to a drop in performance for more complex environments. As stated previously, the reduced parameter complexity should be investigated for larger problem instances.

Table 6: *

Algorithmic Characteristics - Lockwood and Si [LS20] Environment Algorithm Quantum State Action Qubits Parameterized Type Component Space Space  Gates111 encoding gates: qubits×per_qubit𝑞𝑢𝑏𝑖𝑡𝑠𝑝𝑒𝑟normal-_𝑞𝑢𝑏𝑖𝑡qubits\times per\_qubititalic_q italic_u italic_b italic_i italic_t italic_s × italic_p italic_e italic_r _ italic_q italic_u italic_b italic_i italic_t; variational gates: qubits×layers×per_qubit_per_layer𝑞𝑢𝑏𝑖𝑡𝑠𝑙𝑎𝑦𝑒𝑟𝑠𝑝𝑒𝑟normal-_𝑞𝑢𝑏𝑖𝑡normal-_𝑝𝑒𝑟normal-_𝑙𝑎𝑦𝑒𝑟qubits\times layers\times per\_qubit\_per\_layeritalic_q italic_u italic_b italic_i italic_t italic_s × italic_l italic_a italic_y italic_e italic_r italic_s × italic_p italic_e italic_r _ italic_q italic_u italic_b italic_i italic_t _ italic_p italic_e italic_r _ italic_l italic_a italic_y italic_e italic_r; CartPole DDQL Q𝑄Qitalic_Q-function continuous discrete 4444 4×2424\times 24 × 2 (encoding) (OpenAI Gym) 4444-dim 2222 4×3×34334\times 3\times 34 × 3 × 3 (weights) Blackjack DDQL Q𝑄Qitalic_Q-function discrete discrete 3333 3×2323\times 23 × 2 (encoding) (OpenAI Gym) 31×11×23111231\times 11\times 231 × 11 × 2 2222 3×3×33333\times 3\times 33 × 3 × 3 (weights)

Playing Atari with Hybrid Quantum-Classical Reinforcement Learning, Lockwood and Si (2021)


Summary. This work by Lockwood and Si [LS21] extends their previous paper [LS20], which, in turn, was based on Chen et al. [Che+20], where Q𝑄Qitalic_Q-learning with VQC function approximation has been introduced. The paper considers the Atari environments Pong and Breakout, with continuous state space of dimensionality 28.22428.22428.22428.224 (the observations are cropped and converted to images with 84×84×48484484\times 84\times 484 × 84 × 4 pixels). This environment complexity is not tractable with previously introduced encoding schemes, which require one qubit for each dimension. The proposed workaround uses a classical NN to reduce the state dimensionality before encoding it into the VQC.

Underlying Algorithm and Simulation. Similar to Refs. [Che+20, LS20], the concept of DDQL is used. The pipeline is modified by replacing the pure VQC function approximator with a hybrid model. Several different choices are considered, the most important details are highlighted below. The training is performed in an end-to-end manner, i.e., the gradients w.r.t. the VQC parameters are propagated back through the classical encoding network.

Model Architecture. The VQC architecture is, as usually, composed of three parts (i.e. state encoding, variational layers, and action decoding). To encode the state, the raw data is first fed through a classical NN. This outputs a number of values equal to the number of parameters in the feature map, which itself consists of 1111-qubit parameterized rotations. The authors compare the performance of a densely connected and a convolutional neural network (CNN) for this task (the concrete architecture of these networks are not specified). Apart from that, encoding layers of different sizes (and therefore different number of parameters) from 5555 to 15151515 qubits are compared.

The variational layers itself consists of two parts, where the first one is a quantum convolutional neural network (QCNN) [CCL19]. The authors state two motivations for this choice: First, it should help capture the spatial structure of the input images (but it is unclear, whether the encoding part retains the spatial structure). Second, QCNNs help to avoid barren plateaus [Pes+21] (while the experiments show no sign of barren plateaus, it is not clear if this is due to this choice, or the limited size of the employed circuits). After this QCNN there are three repetitions of entanglement gates and parameterized rotations, similar to those also used for state encoding.

The paper proposes two methods to deal with the problem of measurement for unequal number of qubits and actions. The first method performs Pauli-Z𝑍Zitalic_Z measurements on all qubits and uses an appended dense NN. Alternatively, quantum pooling operations [CCL19] are used, which subsequently compress the measurement of two qubits into one.

Experimental Results and Discussion. To demonstrate the basic functionality of the model, initial experiments are conducted on the CartPole environment. The results demonstrate a similar performance to Lockwood and Si [LS20]. On the two Atari environments, the paper considers 12121212 different hybrid architectures (dense vs. convolutional encoding, 5555 vs. 10101010 vs. 15151515 qubits, dense vs. pooling decoding), which are compared to a well-established classical architecture.

It turns out, that the hybrid models are not able to learn at all. The authors state, that this is down to the lack of expressibility of the hybrid models, which only make use of about 104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT parameters, while the classical model uses about 106superscript10610^{6}10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT. It is expected, that for more expressive models the performance improves, as learning on the much simpler CartPole environment was successful.

Remarks. The experiments are conducted with a restricted set of hybrid models. Consequently, the claim that these results do not demonstrate the inapplicability of QRL to more complex environments like Atari is reasonable. The assumption that this approach could be made to work on complex environments, as it succeeds on e.g. CartPole, should be sustained with additional experiments. For a modified architecture succeeding on the Atari environments, it is not completely clear, which part of the work is done by the classical and quantum part of the model. This is a typical caveat, whenever quantum and classical architectures are combined.

Table 7: *

Algorithmic Characteristics - Lockwood and Si [LS21] Environment Algorithm Quantum State Action Qubits Parameterized Type Component Space Space Gates CartPole (OpenAI Gym) DDQL Q𝑄Qitalic_Q-function continuous 4444-dim discrete 2222 5555 N/A𝑁𝐴N/Aitalic_N / italic_A (classical)111 potentially also uses a classical NN for pre-processing, details are not stated; 𝒪(101)𝒪superscript101\mathcal{O}\left(10^{1}\right)caligraphic_O ( 10 start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) (encoding) 𝒪(102)𝒪superscript102\mathcal{O}\left(10^{2}\right)caligraphic_O ( 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (weights) Pong-v0 (OpenAI Gym) DDQL Q𝑄Qitalic_Q-function continuous discrete 6666 5555 to 15151515 𝒪(106)𝒪superscript106\mathcal{O}\left(10^{6}\right)caligraphic_O ( 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT ) (classical) 28224282242822428224- 𝒪(102)𝒪superscript102\mathcal{O}\left(10^{2}\right)caligraphic_O ( 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (encoding) dim222 dimensionality of feature space is reduced with a NN to fit size of feature map; 𝒪(104)𝒪superscript104\mathcal{O}\left(10^{4}\right)caligraphic_O ( 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) (weights) Breakout-v0 (OpenAI Gym) DDQL Q𝑄Qitalic_Q-function continuous discrete 4444 5555 to 15151515 𝒪(106)𝒪superscript106\mathcal{O}\left(10^{6}\right)caligraphic_O ( 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT ) (classical) 28224282242822428224- 𝒪(102)𝒪superscript102\mathcal{O}\left(10^{2}\right)caligraphic_O ( 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (encoding) dim222 dimensionality of feature space is reduced with a NN to fit size of feature map; 𝒪(104)𝒪superscript104\mathcal{O}\left(10^{4}\right)caligraphic_O ( 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) (weights)

Quantum agents in the Gym: a variational quantum algorithm for deep Q𝑄Qitalic_Q-learning, Skolik et al. (2022)


Summary. This work by Skolik et al. [SJD22] proposes another instance of Q𝑄Qitalic_Q-learning with VQCs as function approximators. Being aware of preceding literature, the authors set out to analyze the role of architecture design, RL state encoding schemes, and observables for action decoding. With regard to the previous work, the authors remark that the CartPole environment cannot be considered solved.

Importance of Architecture Design. In terms of architecture choices, the problem of barren plateaus is emphasized: Architectures with many qubits and layers (which naively is required for high expressivity) are hard to train. Contrarily, over-parameterized architectures are easier to train, but probably less expressive and therefore less effective on a given task.

The authors chose a hardware-efficient ansatz, despite being known to run into the barren plateau problem for large circuits. For the small circuit sizes considered in the present work, the barren-plateau problem does not appear to be relevant.

Encoding Schemes. As for encoding schemes, discrete RL states are encoded in the computational basis. Continuous states are scaled to the finite interval [π/2,+π/2]𝜋2𝜋2[-\pi/2,+\pi/2][ - italic_π / 2 , + italic_π / 2 ] by applying arctanarctangent\arctanroman_arctan to the raw observations. The result serves as the rotation angle for an Rxsubscript𝑅𝑥R_{x}italic_R start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT rotation, which is very similar to the scaled encoding proposed by Lockwood and Si [LS20]. In order to increase expressivity w.r.t. to the input, the encoding layer can be repeated through the circuit, forming a data re-uploading structure [Pér+20]. Effectively, this allows to learn and approximate a Fourier sum of a certain order, where the order is tied to the number of repetitions of the encoding layer [SSM21]. The encoding is further modified by introducing learnable re-scaling parameters, that are multiplied with the raw states before computing the arctanarctangent\arctanroman_arctan.

Experimental Results and Discussion. The authors benchmark their architecture choices on the FrozenLake and CartPole environment. The performance on CartPole is compared to a small NN with the same number of parameters, which seems to be inferior. Further, the range of Q𝑄Qitalic_Q-values that can be encountered in the two benchmark environments is investigated. For FrozenLake, representing the Q𝑄Qitalic_Q-value with the expectation values of 1111-qubit Z𝑍Zitalic_Z-operators is sufficient. For the CartPole environment, this strategy is found not to be adaptable enough. Instead, they chose the expectation values of the parities (of 2 non-overlap** pairs of qubits) and allow for additional trainable classical weights that set the scale for the Q𝑄Qitalic_Q-value approximation.

Remarks. The authors emphasize the critical role of architectural choices at the outset of their manuscript. While they offer valuable insights into this topic, also open questions remain for future work in this direction. For the CartPole environment, several trainable classical weights are incorporated in the algorithm. Therefore, it is not completely clear, what part of the training is achieved by which part of the hybrid model.

Error Analysis. The work by Skolik et al. [Sko+23] analysis the influence of hardware noise on the quantum Q𝑄Qitalic_Q-learning framework introduced in [SJD22], but also quantum policy gradient (QPG) approaches discussed in Sec. 4.2.2. The results are numerically validated on the CartPole environment and a version of the Travelling Salesperson Problem. The results indicate, that the performance is very much dependent on the inherent structure of the noise. For some instances, the robustness of the learned policy is actually increased if noise is encountered during training. However, e.g. for strong incoherent noise the performance decreases quite substantially. Interesting from a practical point of view is especially the analysis of shot noise, which indicates that a low number of repetitions is enough to get a reliable estimate of the Q𝑄Qitalic_Q-function – an explicit algorithm to exploit this property is proposed in this work.

Continuous Action Spaces. A Q-learning approach based on [SJD22] that incorporates continuous action spaces is discussed by Liu et al. [LXJ23]. They use normalized advantage functions which allows for continuous action selection. An alternative would be to additionally use a policy function approximator to form an actor-critic approach, as discussed in Sec. 4.2.3.

Table 8: *

Algorithmic Characteristics - Skolik et al. [SJD22] Environment Algorithm Quantum State Action Qubits Parameterized Type Component Space Space  Gates111 encoding gates: qubits×per_qubit𝑞𝑢𝑏𝑖𝑡𝑠𝑝𝑒𝑟normal-_𝑞𝑢𝑏𝑖𝑡qubits\times per\_qubititalic_q italic_u italic_b italic_i italic_t italic_s × italic_p italic_e italic_r _ italic_q italic_u italic_b italic_i italic_t; variational gates: qubits×layers×per_qubit_per_layer𝑞𝑢𝑏𝑖𝑡𝑠𝑙𝑎𝑦𝑒𝑟𝑠𝑝𝑒𝑟normal-_𝑞𝑢𝑏𝑖𝑡normal-_𝑝𝑒𝑟normal-_𝑙𝑎𝑦𝑒𝑟qubits\times layers\times per\_qubit\_per\_layeritalic_q italic_u italic_b italic_i italic_t italic_s × italic_l italic_a italic_y italic_e italic_r italic_s × italic_p italic_e italic_r _ italic_q italic_u italic_b italic_i italic_t _ italic_p italic_e italic_r _ italic_l italic_a italic_y italic_e italic_r; CartPole (OpenAI Gym) DDQL Q𝑄Qitalic_Q-function continuous 4444-dim discrete 2222 4444 4×1414\times 14 × 1 (encoding) 4×15×241524\times 15\times 24 × 15 × 2 (weights) N/A𝑁𝐴N/Aitalic_N / italic_A (classical)222 model incorporates classical weights after measurement, details are not stated; FrozenLake DDQL Q𝑄Qitalic_Q-function discrete discrete 4444 4×1414\times 14 × 1 (encoding) (OpenAI Gym) 16161616 4444 4×15×241524\times 15\times 24 × 15 × 2 (weights)

4.2.2 Policy Approximation

This section covers VQC-based approximations in policy space, as described for the instance of classical policy gradients in Eqs. 13 and 14. The concept was introduced by Jerbi et al. [Jer+21], shortly followed by a slight reformulations in Ref. [Kun22], and an extension to allow for faster computation in Ref. [BAQ23]. Several modifications, including formulating full-quantum interaction with a quantum control environment, have been introduced in Sequeira et al. [SSB23] – with a closer analysis of quantum-accessible environments revealing potential advantage compared to certain classical routines in Ref. [Jer+23]. Algorithmic extensions to the QPG setup were proposed in Ref. [Mey21]. Details on a therein introduced classical post-processing function to improve RL performance are discussed in Meyer et al. [Mey+23a], and quantum natural gradients to enhance trainability are covered by the same authors in [Mey+23].

Citation First Author Title

[Jer+21]

S. Jerbi

Parameterized Quantum Policies for Reinforcement Learning

[Kun22]

L. Kunczik

Reinforcement Learning with Hybrid Quantum Approximation in the NISQ Context

[BAQ23]

Quafu Group

Quafu-RL: The Cloud Quantum Computers based Quantum Reinforcement Learning

[SSB23]

A. Sequeira

Policy gradients using variational quantum circuits

[Jer+23]

S. Jerbi

Quantum Policy Gradient Algorithms

[Mey+23a]

N. Meyer

Quantum Policy Gradient Algorithm with Optimized Action Decoding

[Mey+23]

N. Meyer

Quantum Natural Policy Gradients: Towards Sample-Efficient Reinforcement Learning

Table 9: Work considered for “QRL with VQCs– Policy Approximation” (Sec. 4.2.2)
Parameterized Quantum Policies for Reinforcement Learning, Jerbi et al. (2021) and related work


Summary. The paper by Jerbi et al. [Jer+21] starts out with a small summary of VQC-based ML models. They cite several reports of quantum advantage in the supervised and unsupervised QML. This motivates their approach to go beyond the scope of Q𝑄Qitalic_Q-function approximation [Che+20, LS20, LS21, SJD22], and use the VQC to directly approximated the policy.

Quantum Policy Gradient. After a brief recap of policy gradient methods for solving RL problems, the authors extend those ideas to a QPG approach. More concretely, they quantize the REINFORCE algorithm [Wil92] with value-function baselines by using VQCs as function approximators for the (stochastic) policy. The define two families of VQC-based policies: (1) A RAW-VQC policy, where the action selection follows Born’s rule. It is defined as πθ(a|s)=Pas,θsubscript𝜋𝜃conditional𝑎𝑠subscriptexpectation-valuesubscript𝑃𝑎𝑠𝜃\pi_{\theta}(a|s)=\expectationvalue{P_{a}}_{s,\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a | italic_s ) = ⟨ start_ARG italic_P start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG ⟩ start_POSTSUBSCRIPT italic_s , italic_θ end_POSTSUBSCRIPT, where Pasubscript𝑃𝑎P_{a}italic_P start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT are the projectors on the elements of the computational basis. This allows action selection with only one evaluation of the quantum circuit; (2) A SOFTMAX-VQC policy, defined as πθ(a|s)=eβOas,θ/aeβOas,θsubscript𝜋𝜃conditional𝑎𝑠superscript𝑒𝛽subscriptexpectation-valuesubscript𝑂𝑎𝑠𝜃subscriptsuperscript𝑎superscript𝑒𝛽subscriptexpectation-valuesubscript𝑂superscript𝑎𝑠𝜃\pi_{\theta}(a|s)=e^{\beta\expectationvalue{O_{a}}_{s,\theta}}/\sum_{a^{\prime% }}e^{\beta\expectationvalue{O_{a^{\prime}}}_{s,\theta}}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a | italic_s ) = italic_e start_POSTSUPERSCRIPT italic_β ⟨ start_ARG italic_O start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG ⟩ start_POSTSUBSCRIPT italic_s , italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT / ∑ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_β ⟨ start_ARG italic_O start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG ⟩ start_POSTSUBSCRIPT italic_s , italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. The measurement result of an action-dependent observable Oasubscript𝑂𝑎O_{a}italic_O start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT is fed into a single-parameter softmax-function, to form a PDF. The inverse-temperature parameter β𝛽\betaitalic_β allows to adjust the peak-width of the distribution, i.e., the greediness of the policy.

Circuit Architecture. The ansatz for the VQC is chosen to be hardware-efficient, i.e., only single and two-qubit gates. The RL state is encoded with 1111-qubit rotations. To increase the expressivity of the model, the authors introduce additional learnable state-scaling parameters λ𝜆\lambdaitalic_λ. Those are multiplied to the rotational parameter denoting the state value, i.e., λisisubscript𝜆𝑖subscript𝑠𝑖\lambda_{i}\cdot s_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the value of a 1111-qubit rotation. This also helps circumvent the problem of being restricted to a finite set of frequencies in such an encoding scheme [SSM21]. The feature map is repeated several times, alternating with the variational layer, which forms a data re-uploading structure [Pér+20]. A variational layer consists of CZ𝐶𝑍CZitalic_C italic_Z-gates for creating entanglement in an circular structure. The learnable parameters are used in 1111-qubit parameterized rotation gates. Depending on the policy type, measurements are either conducted in the computational basis, or more complex observables are measured.

Experimental Results. Overall, all agents are able to learn meaningful behavior in the OpenAI Gym environments CartPole, MountainCar, and Acrobot. Further experiments are reported, which serve the purpose of assessing the importance of the various design choices: (1) Circuit depth increases performance and learning speed, where SOFTMAX-VQC policies outperform RAW-VQC policies in all instances; (2) Incorporating learnable state scaling parameters increases learning performance, trainable classical weights (in case of SOFTMAX-VQC) multiplied to expectation values leads to increase in performance; (3) The performance gap between RAW-VQC and softmax-VQC policies seems to stem from the ability to adjust greediness.

Provable and Empirical Quantum Advantage. To the best of our knowledge, this work is the first to corroborate the idea quantum advantage with VQCs in the RL setting. Therefore, the authors devise RL environments (based on the discrete logarithm problem (DLP)), which are supposed to be classically intractable. Any classical algorithm would need a number of samples that scales exponential in the problem size to achieve a low generalization error. A VQC-based algorithm with a very specific architecture only requires a polynomial amount of data. This implies an exponential advantage w.r.t. sample complexity, assuming it is infeasible to efficiently simulate the VQC on classical hardware for large problem instances. The construction of the environment is inspired by previous results from QML, where similar learning separations between classical and quantum models have been demonstrated [LAT21].

Further, the authors report numerical evidence of potential quantum advantage for environments based on expectation values sampled from VQCs. The motivation lies in the (potential) intractability of simulating the given VQC classically for large systems. More concretely, one uses a VQC to define a labeling function (in the sense of a classification task) over the domain [0,2π]2superscript02𝜋2[0,2\pi]^{2}[ 0 , 2 italic_π ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (so-called SL-VQC). This synthetic classification dataset is then rephrased as a RL environment by incorporating some temporal structure (denoted as Cliffwalk-VQC). Numerically, the authors observe a performance separation of models with classical DNNs and VQC-based policies. They claim, that this is likely due to the oscillatory structure in the labeling function.

Remarks. While the proposal of provable quantum advantage is obviously quite encouraging, the practical realization is probably out of reach for the NISQ-era. The idea of solving the task efficiently on quantum hardware is based on Shor’s algorithm. Formulated as a VQC-based RL problem, this would require circuits of complexity far beyond current scope. We think it requires also some more large-scale experiments, to support the empirical learning separation on the SL-VQC and Cliffwalk-VQC environments. A comparison to other hybrid models [Che+20, LS20] shows, that the proposed QPG approach is superior in terms of RL performance on various environments.

Alternative Formulation. In the PhD thesis by L. Kunczik [Kun22] a slightly different formulation of the QPG framework is introduced, where the output of the quantum circuit is compounded with a classical weight vector. However, the underlying routine is very similar to [Jer+21]. Empirical results are reported to verify an desirable scaling of VQC-based (as opposed to NN-based) approaches. However, experiments are to small-scale for reliable statements regarding this correlation.

Cloud Computing. The work by the BAQIS Quafu Group [BAQ23] realizes the framework introduced in Sec. 4.2.2 and executed it on the quantum devices provided via the Quafu cloud services. The results are ambiguous, as the agents trained on hardware are not really able to learn meaningful behaviour – but are also only trained for a very limited number of timesteps, as also acknowledged by the authors.

Table 10: *

Algorithmic Characteristics - Jerbi et al. [Jer+21] Environment Algorithm Quantum State Action Qubits Parameterized Type Component Space Space  Gates111 this entails encoding, scaling, and variational parameters; the SOFTMAX-VQC also uses classical parameters; CartPole REINFORCE Policy continuous discrete 4444 30303030 (OpenAI Gym) 4444-dim 2222 MountainCar REINFORCE Policy continuous discrete 2222 36363636 (OpenAI Gym) 2222-dim 3333 Acrobot REINFORCE Policy continuous discrete 6666 72727272 (OpenAI Gym) 6666-dim 3333 SL-VQC REINFORCE Policy continuous 2222-dim discrete 2222 2222 37 Cliffwalk-VQC (see [Jer+21]) CognitiveRadio REINFORCE Policy discrete discrete n𝑛nitalic_n 30303030 to 75757575 (see [Che+20]) n2superscript𝑛2n^{2}italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT n𝑛nitalic_n for n=2𝑛2n=2italic_n = 2 to 5555

Policy gradients using variational quantum circuits, Sequeira et al. (2023) and related work


Summary. The article by Sequeira et al. [SSB23] proposes a quantum version of the REINFORCE algorithm with a VQC-based function approximator, very similar to Jerbi et al. [Jer+21]. The methods are applied to the classical environments CartPole and Acrobot but also to a simple quantum control problem. It proposes an initialization technique for the variational parameters of a VQC. Following the experimental results, a quantum advantage w.r.t. the number of required parameters and trainability of the models is claimed.

Underlying Reinforcement Learning Algorithm. As in Jerbi et al. [Jer+21], the policy is defined as πθ(a|s)=eβOaθ/aeβOaθsubscript𝜋𝜃conditional𝑎𝑠superscript𝑒𝛽subscriptexpectation-valuesubscript𝑂𝑎𝜃subscriptsuperscript𝑎superscript𝑒𝛽subscriptexpectation-valuesubscript𝑂superscript𝑎𝜃\pi_{\theta}(a|s)=e^{\beta\cdot\expectationvalue{O_{a}}_{\theta}}/\sum_{a^{% \prime}}e^{\beta\cdot\expectationvalue{O_{a^{\prime}}}_{\theta}}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a | italic_s ) = italic_e start_POSTSUPERSCRIPT italic_β ⋅ ⟨ start_ARG italic_O start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG ⟩ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT / ∑ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_β ⋅ ⟨ start_ARG italic_O start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG ⟩ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and REINFORCE updates are performed. Hereby, the expectation values Oaθsubscriptexpectation-valuesubscript𝑂𝑎𝜃\expectationvalue{O_{a}}_{\theta}⟨ start_ARG italic_O start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG ⟩ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT for action a𝑎aitalic_a is defined as the expectation σzaexpectation-valuesuperscriptsubscript𝜎𝑧𝑎\expectationvalue{\sigma_{z}^{a}}⟨ start_ARG italic_σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT end_ARG ⟩, i.e., the expectation value of 1111-qubit Pauli-Z𝑍Zitalic_Z observable measured on the a𝑎aitalic_a-th qubit.

VQC Architecture. The architecture follows the typical three-part structure. In the beginning, the states are encoded with Rxsubscript𝑅𝑥R_{x}italic_R start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT rotations, with the state values normalized to the range [π,π)𝜋𝜋[-\pi,\pi)[ - italic_π , italic_π ). Consequently, the number of qubits has to correspond to max{|𝒜|,|𝒮|}𝒜𝒮\max\{\absolutevalue{\mathcal{A}},\absolutevalue{\mathcal{S}}\}roman_max { | start_ARG caligraphic_A end_ARG | , | start_ARG caligraphic_S end_ARG | }. There are several parameterized layers (see Fig. 9) which incorporate variational parameters in 1111-qubit Rysubscript𝑅𝑦R_{y}italic_R start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT and Rzsubscript𝑅𝑧R_{z}italic_R start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT rotations. The entanglement structure can be described as CX[i,(i+l)modn]𝐶𝑋𝑖modulo𝑖𝑙𝑛CX[i,(i+l)\mod n]italic_C italic_X [ italic_i , ( italic_i + italic_l ) roman_mod italic_n ], where n𝑛nitalic_n is the number of qubits, and l𝑙litalic_l the index of the layer. The measurement of 1111-qubit Pauli-Z𝑍Zitalic_Z observables is a deviation to the procedure proposed by Jerbi et al. [Jer+21], where multi-qubit observables were used.

Refer to caption
Figure 9: VQC architecture proposed by and taken from Sequeira et al. [SSB23]; It deviates from the typical circular entanglement structure.

Complexity of Gradient Estimation. The paper gives an estimation of the required number of samples to get an ϵitalic-ϵ\epsilonitalic_ϵ-approximation of the log-policy gradient. According to this consideration, for a success probability of 1δ1𝛿1-\delta1 - italic_δ, the number of required measurements is bounded by c(1ϵ)2ϵ2log(kδ)𝑐superscript1italic-ϵ2superscriptitalic-ϵ2𝑘𝛿c\cdot\frac{(1-\epsilon)^{2}}{\epsilon^{2}}\cdot\log(\frac{k}{\delta})italic_c ⋅ divide start_ARG ( 1 - italic_ϵ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ roman_log ( start_ARG divide start_ARG italic_k end_ARG start_ARG italic_δ end_ARG end_ARG ). Hereby, c𝑐citalic_c is a constant depending on algorithmic hyperparameters and k𝑘kitalic_k is the number of variational parameters. It is important to state, that this refers to the number of samples / data points required to get a good approximation of the true policy gradient, but not the explicit estimation of the gradients themself via e.g. the parameter-shift rule.

Initialization Technique. There is some work proposing a technique for parameter initialization to avoid barren plateaus [Gra+19]. However, a technique to boost the overall performance has not yet been proposed. Inspired by classical ML, the authors aim to break symmetries between different neurons (as usually initialization with constant values is a bad choice). A typical strategy is to select values uniformly at random from [π,π]𝜋𝜋[-\pi,\pi][ - italic_π , italic_π ], or drawn them following a Gaussian distribution.

Inspired by the classical Glorot initialization scheme [GB10], the paper proposed to use a normal distribution 𝒩(0,std2)𝒩0superscriptstd2\mathcal{N}(0,\mathrm{std}^{2})caligraphic_N ( 0 , roman_std start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) with std=g2/(fanin+fanout)std𝑔2subscriptfan𝑖𝑛subscriptfan𝑜𝑢𝑡\mathrm{std}=g\cdot\sqrt{2/\left(\mathrm{fan}_{in}+\mathrm{fan}_{out}\right)}roman_std = italic_g ⋅ square-root start_ARG 2 / ( roman_fan start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT + roman_fan start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT ) end_ARG. Here, g𝑔gitalic_g is a constant multiplicative factor, faninsubscriptfan𝑖𝑛\mathrm{fan}_{in}roman_fan start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT is the number of embedded features, and fanoutsubscriptfan𝑜𝑢𝑡\mathrm{fan}_{out}roman_fan start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT is the number of computational basis measurements. This technique demonstrates some promising experimental results, but no theoretical justification is given.

Analysis of Fisher Information Spectrum. The paper analyzes the spectrum of the Fisher information matrix (FIM), which serves as a tool to quantify the trainability of a model. The empirical FIM is computed as F(θ)=1Tt=1Tθlogπ(at|st,θ)θlogπ(at|st,θ)t𝐹𝜃1𝑇superscriptsubscript𝑡1𝑇subscript𝜃𝜋conditionalsubscript𝑎𝑡subscript𝑠𝑡𝜃subscript𝜃𝜋superscriptconditionalsubscript𝑎𝑡subscript𝑠𝑡𝜃𝑡F(\theta)=\frac{1}{T}\sum_{t=1}^{T}\nabla_{\theta}\log\pi(a_{t}|s_{t},\theta)% \nabla_{\theta}\log\pi(a_{t}|s_{t},\theta)^{t}italic_F ( italic_θ ) = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_π ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_θ ) ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_π ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_θ ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. A similar analysis has also been proposed for QML [Abb+21].

The results show, that the spectrum of the FIM associated with the quantum model exhibits significantly larger averaged eigenvalues. The compared NN was optimized over several architectures, but not many details are provided in the paper. The authors conclude, that the quantum models are beneficial in terms of trainability, and might be resilient to barren plateaus.

Experimental Results and Discussion of Potential Quantum Advantage. The proposed algorithm is tested on the classical benchmark environments CartPole and Acrobot. The performance is compared to the best classical NN (it is not clear, what best means in this case, and to what extend this holds). The authors claim a significant advantage in terms of convergence speed.

Additional experiments are conducted with the proposed Quantum-Glorot initialization technique. In the two environments CartPole and Acrobot, this technique demonstrates to be beneficial in terms of convergence speed and training stability.

Finally, the experiments are extended to a QuantumControl environment. It requires to learn the map** |0|1ket0ket1\ket{0}\to\ket{1}| start_ARG 0 end_ARG ⟩ → | start_ARG 1 end_ARG ⟩ via the time dependent Hamiltonian H(t)=4J(t)σz+hσx𝐻𝑡4𝐽𝑡subscript𝜎𝑧subscript𝜎𝑥H(t)=4J(t)\sigma_{z}+h\sigma_{x}italic_H ( italic_t ) = 4 italic_J ( italic_t ) italic_σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT + italic_h italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT. This is converted to a set of unitary gates U(t)𝑈𝑡U(t)italic_U ( italic_t ), such that |ψt+1=U(t)|ψketsubscript𝜓𝑡1𝑈𝑡ket𝜓\ket{\psi_{t+1}}=U(t)\ket{\psi}| start_ARG italic_ψ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_ARG ⟩ = italic_U ( italic_t ) | start_ARG italic_ψ end_ARG ⟩. The reward is defined as the overlap between the prepared state and |1ket1\ket{1}| start_ARG 1 end_ARG ⟩, i.e. rt=|ψt|1|2subscript𝑟𝑡superscriptexpectation-valueconditionalsubscript𝜓𝑡12r_{t}=\absolutevalue{\expectationvalue{\psi_{t}|1}}^{2}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = | start_ARG ⟨ start_ARG italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | 1 end_ARG ⟩ end_ARG | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The agent has to decide between the two actions 0=^no pulse0^no pulse0\leavevmode\nobreak\ \hat{=}\leavevmode\nobreak\ \text{no pulse}0 over^ start_ARG = end_ARG no pulse and 1=^apply pulse1^apply pulse1\leavevmode\nobreak\ \hat{=}\leavevmode\nobreak\ \text{apply pulse}1 over^ start_ARG = end_ARG apply pulse. The usage of a quantum environment removes the necessity of encoding classical states. Unfortunately, it is not described, how |ψtketsubscript𝜓𝑡\ket{\psi_{t}}| start_ARG italic_ψ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ⟩ is incorporated in the VQC (a 1111-qubit parameterized circuit is apparently used to solve the task). The results on this environment suggest, that the agent is able to learn the optimal pulses in a low number of epochs.

Summarizing the experiments, the authors claim an advantage in convergence speed compared to classical approaches (questionable, as there should be NNs which perform much better). Additionally, there seems to be a clear advantage in terms of parameter complexity.

Remarks. The authors claim, that it is possible to estimate the log-policy gradient with only an logarithmic amount of samples (in the number of variational parameters). While this certainly holds for simulation, it is not clear, if such a technique can be applied on quantum hardware (e.g., some kind of sparse or perturbed gradients). The introduced initialization strategy gives some good experimental results, although some additional experiments and theoretical justifications would be desirable. The formulation of the empirical FIM drops the dependency on the prior state distribution, which potentially renders the considered spectrum less representative of the model than for a generic supervised learning problem. The claim of quantum advantage w.r.t. parameter complexity and absence of barren plateaus should be supported with experiments on larger-scale environments.

Quantum-Accessible Environments. An explicit analysis of quantum-accessible environments is conducted in Jerbi et al. [Jer+23]. One instance of such an environment is considered in [SSB23], but also [Wu+23] uses a related formulation. The paper derives explicit quadratic advantages in sampling complexity, if the learned policy satisfies certain regularity conditions. We consider this to be a very important step toward identifying the actual potential of QRL. Interestingly, the stated results suggest that most of the scenarios studied in literature actually satisfy the smoothness conditions. An open problem is the identification of practically relevant problems that can be formulated in the described quantum-accessible setting.

Table 11: *

Algorithmic Characteristics - Sequeira et al. [SSB23] Environment Algorithm Quantum State Action Qubits Parameterized Type Component Space Space Gates CartPole REINFORCE Policy continuous discrete 4444 4444 (encoding) (OpenAI Gym) 4444-dim 2222 24242424 (weights) Acrobot REINFORCE Policy continuous discrete 6666 6666 (encoding) (OpenAI Gym) 6666-dim 3333 36363636 (weights) QuantumControl REINFORCE Policy, quantum discrete N/A𝑁𝐴N/Aitalic_N / italic_A 00 (encoding)111 the RL state is a quantum state, i.e. no classical information has to be encoded; (see [SSB23]) Environment 2222 N/A𝑁𝐴N/Aitalic_N / italic_A (weights)

Quantum Policy Gradient Algorithm with Optimized Action Decoding, Meyer et al.
(2023)


Summary. The work by Meyer et al. [Mey+23a] builds upon the QPG framework introduced in [Jer+21]. It takes a closer look at the introduced RAW-VQC policy and – based on measurements in the computational basis – introduces a classical post-processing function for action selection. By optimizing this function w.r.t. a novel quality measure, significant performance improvements can be made. The introduced procedure is also suited for problems with large action spaces. Experiments on a 5555-qubit quantum device represent the first successful training of a VQC-based RL routine on actual quantum hardware.

Classical Post-Processing. The work focuses on the RAW-VQC policy, i.e. π𝜽(a|𝒔)=Pa𝒔,𝜽subscript𝜋𝜽conditional𝑎𝒔subscriptexpectation-valuesubscript𝑃𝑎𝒔𝜽\pi_{\boldsymbol{\theta}}(a|\boldsymbol{s})=\expectationvalue{P_{a}}_{% \boldsymbol{s},\boldsymbol{\theta}}italic_π start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_a | bold_italic_s ) = ⟨ start_ARG italic_P start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_ARG ⟩ start_POSTSUBSCRIPT bold_italic_s , bold_italic_θ end_POSTSUBSCRIPT. For measurements in the computational basis, this can be viewed as a partitioning of all possible bitstrings 𝒞𝒞\mathcal{C}caligraphic_C. This allows the definition of a classical post-processing function f𝒞:{0,1}n{0,1,,|𝒜|1}:subscript𝑓𝒞superscript01𝑛01𝒜1f_{\mathcal{C}}:\{0,1\}^{n}\to\{0,1,\cdots,\absolutevalue{\mathcal{A}}-1\}italic_f start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT : { 0 , 1 } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → { 0 , 1 , ⋯ , | start_ARG caligraphic_A end_ARG | - 1 }, such that f𝒞(𝒃)=asubscript𝑓𝒞𝒃𝑎f_{\mathcal{C}}(\boldsymbol{b})=aitalic_f start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ( bold_italic_b ) = italic_a, iff 𝒃𝒞a𝒃subscript𝒞𝑎\boldsymbol{b}\in\mathcal{C}_{a}bold_italic_b ∈ caligraphic_C start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT. The policy can therefore be expressed as π𝜽(a|𝒔)1Kk=0K1δf𝒞(𝒃(k))=asubscript𝜋𝜽conditional𝑎𝒔1𝐾superscriptsubscript𝑘0𝐾1subscript𝛿subscript𝑓𝒞superscript𝒃𝑘𝑎\pi_{\boldsymbol{\theta}}(a|\boldsymbol{s})\approx\frac{1}{K}\cdot\sum_{k=0}^{% K-1}\delta_{f_{\mathcal{C}}(\boldsymbol{b}^{(k)})=a}italic_π start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_a | bold_italic_s ) ≈ divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ⋅ ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ( bold_italic_b start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) = italic_a end_POSTSUBSCRIPT where 𝒃(k)superscript𝒃𝑘\boldsymbol{b}^{(k)}bold_italic_b start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT is the bitstring observed in the k𝑘kitalic_k-th shot.

Globality Measure. The formulation in terms of a classical post-processing function allows for the definition of a quality measure on the explicitly used partitioning of 𝒞𝒞\mathcal{C}caligraphic_C. The authors start out with the extracted information EIf𝒞(𝒃)subscriptEIsubscript𝑓𝒞𝒃\text{EI}_{f_{\mathcal{C}}}(\boldsymbol{b})EI start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_b ), which denotes the number of bits necessary to get an unambiguous assignment of the bitstring 𝒃𝒃\boldsymbol{b}bold_italic_b to the set 𝒞asubscript𝒞𝑎\mathcal{C}_{a}caligraphic_C start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT it is contained in. This is extended to a globality measure by averaging over all possible bitstrings, i.e. Gf𝒞:=12n𝒃{0,1}nEIf𝒞(𝒃)assignsubscript𝐺subscript𝑓𝒞1superscript2𝑛subscript𝒃superscript01𝑛𝐸subscript𝐼subscript𝑓𝒞𝒃G_{f_{\mathcal{C}}}:=\frac{1}{2^{n}}\sum_{\boldsymbol{b}\in\{0,1\}^{n}}EI_{f_{% \mathcal{C}}}(\boldsymbol{b})italic_G start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT := divide start_ARG 1 end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT bold_italic_b ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_E italic_I start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_b ). This measure quantifies, how much information is used on average to make an decision for an action. While this measure is hard to compute in general, the authors discuss an explicit construction of a post-processing function, that guarantees saturating the globality measure (which is trivially upper-bounded by the number of involved qubits). Based on that construction, an optimal post-processing function is given by f𝒞(𝒃)=[b0bm1(i=mn1bi)]10subscript𝑓𝒞𝒃subscriptdelimited-[]subscript𝑏0subscript𝑏𝑚1superscriptsubscriptdirect-sum𝑖𝑚𝑛1subscript𝑏𝑖10f_{\mathcal{C}}(\boldsymbol{b})=\left[b_{0}\cdots b_{m-1}\left(\bigoplus_{i=m}% ^{n-1}b_{i}\right)\right]_{10}italic_f start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ( bold_italic_b ) = [ italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋯ italic_b start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT ( ⨁ start_POSTSUBSCRIPT italic_i = italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT, where []10subscriptdelimited-[]10\left[\cdot\right]_{10}[ ⋅ ] start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT refers to the decimal representation and m=log2(|𝒜|)1𝑚subscript2𝒜1m=\log_{2}(\absolutevalue{\mathcal{A}})-1italic_m = roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( | start_ARG caligraphic_A end_ARG | ) - 1.

Experimental Results. The claim that a high value of the globality measure correlates with a good RL performance is experimentally demonstrated on several environments. Experiments on the CartPole benchmark with globality values ranging from Gf𝒞=1.0subscript𝐺subscript𝑓𝒞1.0G_{f_{\mathcal{C}}}=1.0italic_G start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 1.0 to the maximum possible Gf𝒞=4.0subscript𝐺subscript𝑓𝒞4.0G_{f_{\mathcal{C}}}=4.0italic_G start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 4.0 show a clear correlation between the measure and the actual performance of the resulting algorithm. It is noted, that the construction of the post-processing function explicitly is detached from the complexity of the actual quantum model, and therefore is a very efficient way to improve the performance. The QRL agents with Gf𝒞>3.0subscript𝐺subscript𝑓𝒞3.0G_{f_{\mathcal{C}}}>3.0italic_G start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT > 3.0 also outperform the SOFTMAX-VQC policy, which was originally conjectured to be superior in [Jer+21]. These results are strengthened by experiments on FrozenLake and ContextualBandits environments. Empirical results regarding effective dimension and the Fisher information spectrum [Abb+21] also demonstrate an improved expressivity and trainability of models with high globality measure.

Training on Quantum Hardware. Using this enhanced QPG algorithm, the authors execute a full training routine on an 8888-state and 2222-action ContextualBandits environment on quantum hardware. They employ a 3333-qubit sub-topology of the 5555-qubit IBM quantum device ibmq_manila [IBM23]. The results confirm, that training VQC-based QRL algorithms on actual hardware is indeed possible. However, there is still a deterioration of performance compared to the noise-free simulation, which is explained by the currently inevitable hardware noise. Verification of the learned parameters demonstrates, that the agent actually identifies the optimal action in all cases, only the certainty of that decision is less pronounced compared to simulation.

Remarks. The described action decoding procedure is easy to extend to problems with large action spaces. However, some additional engineering is necessary to account for action spaces of size that cannot be expressed as a power of two. It is left open, at which point the benefit of using a post-processing function with high globality is out-weighted by the likely occurrence of barren plateaus [Cer+21]. Potentially the flexible definition of the post-processing function can be used to balance those two objectives. While the demonstration of trainability on quantum hardware is certainly pretty small-scale, it can be considered an important step towards the practical usability of these type of algorithms.

Table 12: *

Algorithmic Characteristics - Meyer et al. [Mey+23a] Environment Algorithm Quantum State Action Qubits Parameterized Type Component Space Space  Gates111 this entails encoding, scaling, and variational parameters; CartPole REINFORCE Policy continuous discrete 4444 24242424 to 40404040222 the SOFTMAX-VQC also uses additional classical parameters; (OpenAI Gym) 4444-dim 2222 FrozenLake REINFORCE Policy discrete discrete 4444 24242424 to 40404040 (OpenAI Gym) 16161616 4444 ContextualBandits REINFORCE Policy discrete discrete 5555 70707070 (see [SB18]) 32323232 8888 ContextualBandits REINFORCE Policy discrete discrete 3333 30303030 (see [SB18])333 hardware experiment: modified circuit structure to reduce transpilation overhead, details in [Mey+23a]; 8888 2222

Quantum Natural Policy Gradients: Towards Sample-Efficient Reinforcement Learning, Meyer et al. (2023)


Summary. The paper by Meyer et al. [Mey+23] proposes an enhanced training routine for the framework proposed in [Jer+21] and extended in [Mey+23a]. A second-order extension – based on so-called quantum natural gradients – is employed to define the quantum natural policy gradient (QNPG) algorithm. The modified technique is experimentally demonstrated to have preferable properties regarding trainability, and is also verified on actual quantum hardware.

Natural Gradients. The original QPG algorithm is trained based on first-order updates, i.e. Δ𝜽=α𝜽(𝜽)Δ𝜽𝛼subscript𝜽𝜽\Delta\boldsymbol{\theta}=\alpha\nabla_{\boldsymbol{\theta}}\mathcal{L}(% \boldsymbol{\theta})roman_Δ bold_italic_θ = italic_α ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT caligraphic_L ( bold_italic_θ ). This update structure has the shortcoming, that it is closely tied to the Euclidean geometry and does not take into account the actual curvature of the loss landscape. This can be mitigated by using the FIM F(𝜽)𝐹𝜽F(\boldsymbol{\theta})italic_F ( bold_italic_θ ), which describes the local curvature of the parameter space around a given point. This can be used to define a natural gradient update as Δ𝜽=αF1(𝜽)𝜽(𝜽)Δ𝜽𝛼superscript𝐹1𝜽subscript𝜽𝜽\Delta\boldsymbol{\theta}=\alpha F^{-1}(\boldsymbol{\theta})\nabla_{% \boldsymbol{\theta}}\mathcal{L}(\boldsymbol{\theta})roman_Δ bold_italic_θ = italic_α italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_θ ) ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT caligraphic_L ( bold_italic_θ ) [Ama98].

Refer to caption
Figure 10: QNPG pipeline proposed by Meyer et al. [Mey+23]; The pseudoinverse of the quantum FIM is used to perform training in an undistorted neighborhood of the loss landscape.

Quantum Natural Policy Gradients. In order to employ this concept for training in the quantum realm, the paper employs a generalization of the classical FIM. This quantum FIM (derived from the Fubini-Study metric tensor [Che10]) g(𝜽)𝑔𝜽g(\boldsymbol{\theta})italic_g ( bold_italic_θ ) is hard to compute in general – however, a block-diagonal approximation can be estimated efficiently in hardware [Sto+20]. Based on that the paper defines the QNPG update rule as Δ𝜽=αg(𝜽)𝜽(𝜽)Δ𝜽𝛼superscript𝑔𝜽subscript𝜽𝜽\Delta\boldsymbol{\theta}=\alpha g^{\dagger}(\boldsymbol{\theta})\nabla_{% \boldsymbol{\theta}}\mathcal{L}(\boldsymbol{\theta})roman_Δ bold_italic_θ = italic_α italic_g start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( bold_italic_θ ) ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT caligraphic_L ( bold_italic_θ ). Additionally, a regularized version of the QNPG algorithm is introduced, to counter instabilities encountered during inverting the quantum FIM. It has to be highlighted, that the overhead of incorporating these second-order update rule is almost negligible compared to the anyways necessary computation of first-order gradients. The pipeline of the overall algorithm is visualized in Fig. 10.

Experimental Results. The effectiveness of the training routine is demonstrated on different instances of ContextualBandits. On a small-scale setting with only a single qubit and two trainable parameters, it is shown that the (regularized) QNPG algorithm converges significantly faster for random initializations compared to the original QPG formulation. For specific initializations it is moreover validated, that the second-order extension does what it was designed for and helps to traverse distorted regions of the loss landscape. An up-scaled experiment with a 12121212-qubit VQC underlines the efficiency of the introduced routine.

Training on Quantum Hardware. To demonstrate the practical feasibility of the QPG approach the authors train an medium-scale instance on actual quantum hardware. The experiment employs a 12121212-qubit sub-topology of the 27272727-qubit system ibmq_ehningen [IBM23]. The results demonstrate, that the quantum agent is actually able to learn meaningful behavior in the 4096409640964096-state ContextualBandits environment. There is some deterioration of the performance compared to noise-free simulation, which is not caused by the training routine itself, as demonstrated by experiments with analytically optimal parameters. However, the learned policy identifies the correct action in a majority of the cases, similar to the hardware results in [Mey+23a].

Remarks. The paper demonstrates the effectiveness of the QNPG routine for ContextualBandits environment, the extension to more generic problems is however left for future work. A very interesting consideration is the influence of quantum natural gradients on the barren plateau problem, which is discussed with different results in Refs. [HK21, Tha+23]. The hardware experiment using 12121212 qubits is a big improvement upon the results in [Mey+23a] and can be considered as the currently largest-scale practical demonstration of VQC-based QRL.

Table 13: *

Algorithmic Characteristics - Meyer et al. [Mey+23] Environment Algorithm Quantum State Action Qubits Parameterized Type Component Space Space  Gates ContextualBandits QNPG Policy discrete discrete 1111 1111 (encoding) (see [SB18]) 2222 2222 2222 (weights) ContextualBandits QNPG Policy discrete discrete 12121212 12121212 (encoding) (see [SB18])111 hardware experiment: hardware-native circuit structure, details in [Mey+23]; 4096409640964096 2222 36363636 (weights)

4.2.3 Combined Approximations

It is possible to combine the approach of approximation in value space from Sec. 4.2.1 and in policy space from Sec. 4.2.2. This is formulated in an actor-critic approach in Wu et al. [Wu+23], which is re-implemented and extended in Refs. [Kwa+21, Ree23]. An asynchronous training routine is proposed by S. Y.-C. Chen [Che23]. A soft actor-critic formulation is described by Q. Lan [Lan21]. An extension to multiple agents is proposed in Yun et al. [Yun+22] and extended in Ref. [YPK23].

An overview of progress in the field of quantum multi-agent RL can be found in Ref. [ZY23].

Citation First Author Title

[Wu+23]

S. Wu

Quantum reinforcement learning in continuous action space

[Kwa+21]

Y. Kwak

Introduction to Quantum Reinforcement Learning: Theory and PennyLane-based Implementation

[Ree23]

V. Reers

Towards Performance Benchmarking for Quantum Reinforcement Learning

[Che23]

S. Y.-C. Chen

Asynchronous training of quantum reinforcement learning

[Lan21]

Q. Lan

Variational Quantum Soft Actor-Critic

[Yun+22]

W. J. Yun

Quantum Multi-Agent Reinforcement Learning via Variational Quantum Circuit Design

[YPK23]

W. J. Yun

Quantum Multi-Agent Meta Reinforcement Learning

Table 14: Work considered for “QRL with VQCs– Combined Approximations” (Sec. 4.2.3)
Quantum reinforcement learning in continuous action space, Wu et al. (2023)


Summary. This paper by Wu et al. [Wu+23] extends the concept of VQC-based RL to continuous action spaces. The authors choose a quantum control environment, more concretely one that encodes an eigenvalue problem. This allows to interpret the action as a (parameterized) unitary. The experimental results suggest an exponential reduction in model complexity compared to classical approaches.

Eigenvalue Problem as RL Environment. The RL agent has to solve an eigenvalue problem, i.e., find the eigenvalue of a given Hamiltonian. This should be done in the following iterative procedure: Let H𝐻Hitalic_H be the Hamiltonian of an n𝑛nitalic_n-qubit quantum system E𝐸Eitalic_E and |s0ketsubscript𝑠0\ket{s_{0}}| start_ARG italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ⟩ an initial state from E𝐸Eitalic_E. The system should be driven towards the eigenstate of H𝐻Hitalic_H, denoted as |u0ketsubscript𝑢0\ket{u_{0}}| start_ARG italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ⟩. Also the corresponding eigenvalue λ0subscript𝜆0\lambda_{0}italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT should be returned. Although not explicitly stated in the paper, we assume the agent should search for the eigenstate with the associated smallest eigenvalue, as this corresponds to the ground state.

The observation for this environment is the current quantum state |stketsubscript𝑠𝑡\ket{s_{t}}| start_ARG italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ⟩, which is provided to the agent via some quantum channel. The actions the agent can execute correspond to parameterized unitaries U(θt)𝑈subscript𝜃𝑡U(\theta_{t})italic_U ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), where θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are classical parameters sampled from the VQC via measurements. Once instantiated, this unitary is applied to |stketsubscript𝑠𝑡\ket{s_{t}}| start_ARG italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ⟩ to evolve the state U(θt)|st=|st+1𝑈subscript𝜃𝑡ketsubscript𝑠𝑡ketsubscript𝑠𝑡1U(\theta_{t})\ket{s_{t}}=\ket{s_{t+1}}italic_U ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | start_ARG italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ⟩ = | start_ARG italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_ARG ⟩. The agent receives a classical reward, which describes the closeness of the current state to the searched eigenstate of the Hamiltonian.

The authors state, that their proposed technique has some parallels to Grover’s search. More concretely, the trained agent provides an alternative to the amplitude amplification procedure, which could alternatively be used to solve the task at hand.

Model Architecture and Underlying RL Algorithm. The overall approach can be considered hybrid, as the optimization of the VQC parameters is still conducted on classical hardware. A schematic description of the approach is given in Fig. 11. The agent observes a quantum state from the environment, which is used as the initial state |stketsubscript𝑠𝑡\ket{s_{t}}| start_ARG italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ⟩ of the VQC function approximator. Measurements on the prepared quantum state determine the parameters |θtketsubscript𝜃𝑡\ket{\theta_{t}}| start_ARG italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ⟩. Those are then fed into the unitary operator |U(θt)ket𝑈subscript𝜃𝑡\ket{U(\theta_{t})}| start_ARG italic_U ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ⟩ and applied to the environment state. The new state |st+1ketsubscript𝑠𝑡1\ket{s_{t+1}}| start_ARG italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_ARG ⟩, combined with an ancilla reward qubit initialized to |0ket0\ket{0}| start_ARG 0 end_ARG ⟩, is then evolved using some user-defined reward unitary Ursubscript𝑈𝑟U_{r}italic_U start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. Measurements are performed on this state to determine the reward produced by the executed action. This procedure repeats several timesteps, with the objective to approximate the eigenstate |u0ketsubscript𝑢0\ket{u_{0}}| start_ARG italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ⟩.

Refer to caption
(a) The QRL model. Each iterative step can be described by the following loop: (1) at step t𝑡titalic_t, the agent receives |stketsubscript𝑠𝑡\ket{s_{t}}| start_ARG italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ⟩ and generates the action parameter 𝜽tsubscript𝜽𝑡\boldsymbol{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT according to the current policy; (2) the agent generates |st+1=U(𝜽t)|stketsubscript𝑠𝑡1𝑈subscript𝜽𝑡ketsubscript𝑠𝑡\ket{s_{t+1}}=U(\boldsymbol{\theta}_{t})\ket{s_{t}}| start_ARG italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_ARG ⟩ = italic_U ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | start_ARG italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ⟩ (3) based on |stketsubscript𝑠𝑡\ket{s_{t}}| start_ARG italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ⟩ and |st+1ketsubscript𝑠𝑡1\ket{s_{t+1}}| start_ARG italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_ARG ⟩, a reward rt+1subscript𝑟𝑡1r_{t+1}italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT is calculated and fed back to the agent, together with |st+1ketsubscript𝑠𝑡1\ket{s_{t+1}}| start_ARG italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_ARG ⟩; (4) based on st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT and rt+1subscript𝑟𝑡1r_{t+1}italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT, the policy is updated and then used to generate 𝜽t+1subscript𝜽𝑡1\boldsymbol{\theta}_{t+1}bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT.
Refer to caption
(b) The quantum circuit for our QRL framework at each iteration. The entire QRL process includes two stages, so we give the circuit separately. In stage 1, the circuit includes two registers: the reward register, initialized |0ket0\ket{0}| start_ARG 0 end_ARG ⟩, and the environment register |stketsubscript𝑠𝑡\ket{s_{t}}| start_ARG italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ⟩. Upolicysubscript𝑈𝑝𝑜𝑙𝑖𝑐𝑦U_{policy}italic_U start_POSTSUBSCRIPT italic_p italic_o italic_l italic_i italic_c italic_y end_POSTSUBSCRIPT is generated by the quantum neural network, and determines the action unitary U(𝜽t)𝑈subscript𝜽𝑡U(\boldsymbol{\theta}_{t})italic_U ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Ursubscript𝑈𝑟U_{r}italic_U start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and M𝑀Mitalic_M are designed to generate the reward rt+1subscript𝑟𝑡1r_{t+1}italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT. In stage 2, the circuit has only environment register and does not need to feedback the reward value and update the policy.
Figure 11: Hybrid model for quantum environment proposed by and taken from Wu et al. [Wu+23] (including subcaptions); We note an ambiguity in notation, as the parameters θt,isubscript𝜃𝑡𝑖\theta_{t,i}italic_θ start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT must not be confused with the parameters of the action unitary U(θ)𝑈𝜃U(\theta)italic_U ( italic_θ ). The first set are the ones updated by the RL algorithm, the other ones are extracted via measurements from the quantum state prepared by the VQC.

The VQC architecture does not incorporate a feature map, as the observation |stketsubscript𝑠𝑡\ket{s_{t}}| start_ARG italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ⟩ is used as the initial state |ΦketΦ\ket{\Phi}| start_ARG roman_Φ end_ARG ⟩. Each parameterized layer consists of 1111-qubit rotations and a circular entanglement structure. For every element of the action parameters θjsubscript𝜃𝑗\theta_{j}italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, there is an associated observable Bjsubscript𝐵𝑗B_{j}italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, which is measured on the prepared quantum state. (The paper does not mention, how the action unitary U(θ)𝑈𝜃U(\theta)italic_U ( italic_θ ) is explicitly constructed.) Following this step, a phase estimation circuit implements the reward unitary Ur=UPEsubscript𝑈𝑟subscript𝑈𝑃𝐸U_{r}=U_{PE}italic_U start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = italic_U start_POSTSUBSCRIPT italic_P italic_E end_POSTSUBSCRIPT. This transforms the state to the basis of eigenstates, i.e., UPE|0|st+1=k=1nαt+1,k|λk|uksubscript𝑈𝑃𝐸ket0ketsubscript𝑠𝑡1superscriptsubscript𝑘1𝑛subscript𝛼𝑡1𝑘ketsubscript𝜆𝑘ketsubscript𝑢𝑘U_{PE}\ket{0}\ket{s_{t+1}}=\sum_{k=1}^{n}\alpha_{t+1,k}\ket{\lambda_{k}}\ket{u% _{k}}italic_U start_POSTSUBSCRIPT italic_P italic_E end_POSTSUBSCRIPT | start_ARG 0 end_ARG ⟩ | start_ARG italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_ARG ⟩ = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_t + 1 , italic_k end_POSTSUBSCRIPT | start_ARG italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ⟩ | start_ARG italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ⟩. With a measurement of the eigenvalue phase register, the desired eigenvalue λ0subscript𝜆0\lambda_{0}italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is observed with a probability of pt+1=|αt+1,0|2=|st+1|u0|2subscript𝑝𝑡1superscriptsubscript𝛼𝑡102superscriptexpectation-valueconditionalsubscript𝑠𝑡1subscript𝑢02p_{t+1}=|\alpha_{t+1,0}|^{2}=|\expectationvalue{s_{t+1}|u_{0}}|^{2}italic_p start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = | italic_α start_POSTSUBSCRIPT italic_t + 1 , 0 end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = | ⟨ start_ARG italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ⟩ | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The reward can then be defined as e.g. rt+1=pt+1ptsubscript𝑟𝑡1subscript𝑝𝑡1subscript𝑝𝑡r_{t+1}=p_{t+1}-p_{t}italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Obviously, for pt+11subscript𝑝𝑡11p_{t+1}\to 1italic_p start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT → 1, the state |st+1ketsubscript𝑠𝑡1\ket{s_{t+1}}| start_ARG italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_ARG ⟩ converges to |u0ketsubscript𝑢0\ket{u_{0}}| start_ARG italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ⟩.

The underlying RL routine is an actor-critic method. Therefore, the paper combines a policy-VQC as actor and a Q𝑄Qitalic_Q-function-VQC as critic to a so-called quantum deep deterministic policy gradient (QDDPG) algorithm. The experience of the agent, i.e., tuples (|st,θt,rt,|st+1)ketsubscript𝑠𝑡subscript𝜃𝑡subscript𝑟𝑡ketsubscript𝑠𝑡1(\ket{s_{t}},\theta_{t},r_{t},\ket{s_{t+1}})( | start_ARG italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ⟩ , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , | start_ARG italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_ARG ⟩ ), are stored in a replay buffer to prevent overfitting. Additionally, target networks are employed for both, the actor and the critic.

Experimental Results and Model Complexity. All experimental results in the paper are based on classical simulations. For training, the Hamiltonian H=14(sxσx+syσy+szσz+I)𝐻14subscript𝑠𝑥subscript𝜎𝑥subscript𝑠𝑦subscript𝜎𝑦subscript𝑠𝑧subscript𝜎𝑧𝐼H=\frac{1}{4}(s_{x}\sigma_{x}+s_{y}\sigma_{y}+s_{z}\sigma_{z}+I)italic_H = divide start_ARG 1 end_ARG start_ARG 4 end_ARG ( italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + italic_s start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT + italic_s start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT + italic_I ) is instantiated with the coefficients (sx,sy,sz)=(0.13,0.28,0.95)subscript𝑠𝑥subscript𝑠𝑦subscript𝑠𝑧0.130.280.95(s_{x},s_{y},s_{z})=(0.13,0.28,0.95)( italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) = ( 0.13 , 0.28 , 0.95 ). Concrete details on the training procedure, e.g., the number of episodes, are not stated. The trained model is applied to 1000100010001000 random initial states. The overlap with the respective |u0ketsubscript𝑢0\ket{u_{0}}| start_ARG italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ⟩ is approaching one, consequently the agent is able to get quite close to the desired eigenstate in all cases. The trained model shows good generalization capabilities, i.e., it can be applied to various initial states. This is in contrast to e.g. a variational quantum eigensolver (VQE), where the control pulse for one initial state is meaningless for other ones.

The overall gate complexity for one RL episode is stated as 𝒪(mpolylog(N))𝒪𝑚polylog𝑁\mathcal{O}(m\cdot\mathrm{polylog}(N))caligraphic_O ( italic_m ⋅ roman_polylog ( italic_N ) ). Here, m𝑚mitalic_m is the number of shots for sampling expectation values and N𝑁Nitalic_N denotes the number of qubits. This statement assumes that H𝐻Hitalic_H can be efficiently simulated as otherwise the complexity of UPEsubscript𝑈𝑃𝐸U_{PE}italic_U start_POSTSUBSCRIPT italic_P italic_E end_POSTSUBSCRIPT would exceed 𝒪(polylog(N))𝒪polylog𝑁\mathcal{O}(\mathrm{polylog}(N))caligraphic_O ( roman_polylog ( italic_N ) ). Additionally, all VQCs in the method are also assumed to have a gate complexity of at most 𝒪(polylog(N))𝒪polylog𝑁\mathcal{O}(\mathrm{polylog}(N))caligraphic_O ( roman_polylog ( italic_N ) ). With this perquisites, the authors claims an exponential advantage in model complexity compared to classical approaches.

Generalization to Discrete Action Spaces. The paper also generalizes the presented concept to discrete action spaces, with the FrozenLake environment as an example. The observations are encoded as basis states into the VQC via computational encoding, similar to Chen et al. [Che+20]. The movements applied by the actions are formulated as unitaries acting on the VQC state. A slight generalization of Chen et al. [Che+20] is used for this, which allows to perform the transforms |0|1ket0ket1\ket{0}\to\ket{1}| start_ARG 0 end_ARG ⟩ → | start_ARG 1 end_ARG ⟩ and |1|0ket1ket0\ket{1}\to\ket{0}| start_ARG 1 end_ARG ⟩ → | start_ARG 0 end_ARG ⟩ in a parameterized manner. The reward unitary is formulated in a similar fashion. It is stated that experiments with this configuration were successful, but no concrete results are provided.

Remarks. There are some caveats and ambiguities we identified regarding the proposed approach. First, the algorithm requires knowledge of and ability to prepare the desired eigenstate |u0ketsubscript𝑢0\ket{u_{0}}| start_ARG italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ⟩ for the training procedure. With this state already known, the whole procedure of reproducing it is a somewhat circular task. However, as the learned model seems to generalize to different input states, the technique offers clear advantage over approaches like quantum phase estimation. Second, the model requires repeated preparation of the environment state |stketsubscript𝑠𝑡\ket{s_{t}}| start_ARG italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ⟩, as it is disturbed by measurements to extract the reward information. This should be doable, as one knows the state preparation routine |st=U(θt1)U(θ0)|s0ketsubscript𝑠𝑡𝑈subscript𝜃𝑡1𝑈subscript𝜃0ketsubscript𝑠0\ket{s_{t}}=U(\theta_{t-1})\cdots U(\theta_{0})\ket{s_{0}}| start_ARG italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ⟩ = italic_U ( italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ⋯ italic_U ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) | start_ARG italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ⟩. The influence of this additional overhead is unfortunately not considered in the complexity considerations discussed above. Third, the claim of exponential quantum advantage w.r.t. model complexity (i.e. 𝒪(polylog(N))𝒪polylog𝑁\mathcal{O}(\mathrm{polylog}(N))caligraphic_O ( roman_polylog ( italic_N ) ) for all VQCs) should be supported by larger-scale experiments.

Table 15: *

Algorithmic Characteristics - Wu et al. [Wu+23] Environment Algorithm Quantum State Action Qubits Parameterized Type Component Space Space Gates Quantum Actor-Critic QDDPG Q𝑄Qitalic_Q-function, quantum conti- nuous111 output is interpreted as parameters of a unitary, i.e. a quantum operation applied to the environment; n𝑛nitalic_n 00 (encoding)222 the RL state is a quantum state, i.e. no classical information has to be encoded; n×d×3𝑛𝑑3n\times d\times 3italic_n × italic_d × 3 (weights)333 variational gates: qubits×layers×per_qubit_per_layer𝑞𝑢𝑏𝑖𝑡𝑠𝑙𝑎𝑦𝑒𝑟𝑠𝑝𝑒𝑟_𝑞𝑢𝑏𝑖𝑡_𝑝𝑒𝑟_𝑙𝑎𝑦𝑒𝑟qubits\times layers\times per\_qubit\_per\_layeritalic_q italic_u italic_b italic_i italic_t italic_s × italic_l italic_a italic_y italic_e italic_r italic_s × italic_p italic_e italic_r _ italic_q italic_u italic_b italic_i italic_t _ italic_p italic_e italic_r _ italic_l italic_a italic_y italic_e italic_r; details are not specified; Eigenvalues Policy, (see [Wu+23]) Environment FrozenLake Actor-Critic Q𝑄Qitalic_Q-function, discrete444 state and action space are encoded into the quantum realm for a neat integration into the pipeline; discrete444 state and action space are encoded into the quantum realm for a neat integration into the pipeline; n𝑛nitalic_n 00 (encoding)222 the RL state is a quantum state, i.e. no classical information has to be encoded; (OpenAI Gym) QDDPG Policy 16161616 4444 N/A𝑁𝐴N/Aitalic_N / italic_A (weights)

Introduction to Quantum Reinforcement Learning: Theory and PennyLane-based Implementation, Kwak et al. (2021)


Summary. The paper by Kwak et al. [Kwa+21] gives a short introduction to both RL and (variational) QC. This is followed up by a tutorial on how to implement a VQC-enhanced RL algorithm with PennyLane to solve the CartPole environment.

Hybrid RL Agent. The paper employs the typical hybrid structure, with the VQC as a function approximator. The optimization of the parameters and the interaction with the CartPole environment is executed on classical hardware. The underlying algorithm uses an actor-critic approach, where the actor is quantum and the critic is classical. A set of 1111-qubit rotations is used to encode the state of the CartPole environment into the four-qubit system. This encoding layer is followed by 4444 layers with learnable 1111-qubit rotations and an unspecified entanglement structure. The result is extracted from the measurement of 2222 qubits in the computational basis and the respective expectation values are interpreted as the action-value function.

Remarks. The agent is able to surpass random behavior, but lacks behind other hybrid approaches [LS20, Jer+21]. To the best of our understanding, the implemented quantum actor-critic approach deviates in some details from previously considered approaches. Most importantly, a hybrid approach is used, where the actor is represented with a VQC and the critic employs a classical DNN. A benchmark analysis of the described setup is proposed and conducted by V. Reers [Ree23].

Table 16: *

Algorithmic Characteristics - Kwak et al. [Kwa+21] Environment Algorithm Quantum State Action Qubits Parameterized Type Component Space Space  Gates111 encoding gates: qubits×per_qubit𝑞𝑢𝑏𝑖𝑡𝑠𝑝𝑒𝑟normal-_𝑞𝑢𝑏𝑖𝑡qubits\times per\_qubititalic_q italic_u italic_b italic_i italic_t italic_s × italic_p italic_e italic_r _ italic_q italic_u italic_b italic_i italic_t; variational gates: qubits×layers×per_qubit_per_layer𝑞𝑢𝑏𝑖𝑡𝑠𝑙𝑎𝑦𝑒𝑟𝑠𝑝𝑒𝑟normal-_𝑞𝑢𝑏𝑖𝑡normal-_𝑝𝑒𝑟normal-_𝑙𝑎𝑦𝑒𝑟qubits\times layers\times per\_qubit\_per\_layeritalic_q italic_u italic_b italic_i italic_t italic_s × italic_l italic_a italic_y italic_e italic_r italic_s × italic_p italic_e italic_r _ italic_q italic_u italic_b italic_i italic_t _ italic_p italic_e italic_r _ italic_l italic_a italic_y italic_e italic_r; CartPole Actor-Critic222 only the actor employs a VQC, the critic uses a classical DNN; Q𝑄Qitalic_Q-function continuous discrete 4444 4×1414\times 14 × 1 (encoding) (OpenAI Gym) 4444-dim 2222 4×4×34434\times 4\times 34 × 4 × 3 (weights)

Asynchronous training of quantum reinforcement learning, S. Y.-C. Chen (2023)


Summary. This work by S. Y.-C. Chen [Che23] introduces an actor-critic approach, that is trainable in an asynchronous fashion. This yields the big advantage, that training could be spread out over several classical simulators or quantum hardware devices. The efficiency of the introduced quantum asynchronous advantage actor critic (QA3C) algorithm compared to previous formulations is demonstrated on several benchmark environments.

Quantum A3C. The underlying concept is based on the classical A3C algorithm [Mni+16]. This framework makes use of a global shared memory and a process-specific memory for each individual agent. Each agent interacts with the environment independently, and only once certain criteria are met the global model is updated using the information provided by the local agents. This enables a distributed and therefore easy parallelizable training routine. The approximator for Q𝑄Qitalic_Q-function and policy both are realized using VQCs with classical neural networks pre- and appended to form a hybrid model.

Experimental Results. The proposed QA3C algorithm is executed on the environments Acrobot, CartPole, and MiniGrid-SimpleCrossing. It is observed over all instances, that the hybrid quantum model is competitive with a much larger classical model. Moreover it is demonstrated, that QA3C outperforms classical A3C employing classical models of comparable complexity.

Remarks. The distribution of the training among several workers is certainly an important consideration taking the current access modalities of quantum hardware providers into account. However, it is not clear if training practically can be distributed considering the long queue waiting times. Moreover, it has to be taken into account, that it is not clear what actually is the role of the VQC, due to the appended neural networks. However, the comparison to full-classical agents of similar size is an interesting consideration. As usually it has to be highlighted that the experiments were to small-scale to make meaningful statements on potential quantum advantage.

Table 17: *

Algorithmic Characteristics - S. Y.-C. Chen et al. [Che23] Environment Algorithm Quantum State Action Qubits Parameterized Type Component Space Space  Gates111 the training process is distributed over 80808080 workers, which incorporate a local copy of the parameters; Acrobot (OpenAI Gym) Actor-critic QA3C Q𝑄Qitalic_Q-function, Policy continuous 6666-dim333 action and state-spaces are mapped to the required dimensionality by using classical neural networks; discrete 3333333 action and state-spaces are mapped to the required dimensionality by using classical neural networks; 8888 N/A𝑁𝐴N/Aitalic_N / italic_A (encoding) 48484848 (weights)222 actor and the critic are composed of an individual hybrid model, i.e. the number of weights are doubled; 148148148148 (classical) CartPole (OpenAI Gym) Actor-critic QA3C Q𝑄Qitalic_Q-function, Policy continuous 4444-dim333 action and state-spaces are mapped to the required dimensionality by using classical neural networks; discrete 2222333 action and state-spaces are mapped to the required dimensionality by using classical neural networks; 8888 N/A𝑁𝐴N/Aitalic_N / italic_A (encoding) 48484848 (weights)222 actor and the critic are composed of an individual hybrid model, i.e. the number of weights are doubled; 107107107107 (classical) SimpleCrossing (OpenAI Gym) Actor-critic QA3C Q𝑄Qitalic_Q-function, Policy continuous 127127127127-dim333 action and state-spaces are mapped to the required dimensionality by using classical neural networks; discrete 6666333 action and state-spaces are mapped to the required dimensionality by using classical neural networks; 8888 N/A𝑁𝐴N/Aitalic_N / italic_A (encoding) 48484848 (weights)222 actor and the critic are composed of an individual hybrid model, i.e. the number of weights are doubled; 2431243124312431 (classical)

Variational Quantum Soft Actor-Critic, Q. Lan (2021)


Summary. The paper by Q. Lan [Lan21] introduces a quantum version of a soft actor-critic (SAC) approach. The advantage of this algorithm, compared to previous suggestions, is the possibility to work with a continuous action space. The algorithm is tested on the Pendulum environment.

Soft Actor-Critic for Continuous Control. The term continuous control refers to a setup, in which the agent acts in a continuous action space. Most publications in the context of QRL deal with discrete action spaces, while a few others discuss continuous control for quantum environments [Wu+23, SSB23]. This work focuses on classical environments, which requires some kind of action decoding based on measurements of the quantum state. Instead of directly selecting the actions based on measurement results, the parameterized hybrid model learns the parameters of a distribution, from which the action is sampled. The VQC, and a downstream NN, are used to represent mean μ𝜇\muitalic_μ and variance σ𝜎\sigmaitalic_σ of a Gaussian distribution. This allows the agent to act in a continuous action space in a straightforward manner.

In contrast to the standard RL setup, SAC [Haa+18] not only aims to optimize the expected return, but also the policy entropy [Zie+08, Haa+17]. Therefore, the expected return is defined as Gt=i=tγit(r(si,ai)+α[πθ(|si)])G_{t}=\sum_{i=t}^{\infty}\gamma^{i-t}(r(s_{i},a_{i})+\alpha\mathcal{H}[\pi_{% \theta}(\cdot|s_{i})])italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_i - italic_t end_POSTSUPERSCRIPT ( italic_r ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_α caligraphic_H [ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] ), where [p]=p(x)logp(x)dxdelimited-[]𝑝subscript𝑝𝑥𝑝𝑥differential-d𝑥\mathcal{H}[p]=-\int_{\mathbb{R}}p(x)\log p(x)\mathrm{d}xcaligraphic_H [ italic_p ] = - ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT italic_p ( italic_x ) roman_log italic_p ( italic_x ) roman_d italic_x is the differential entropy for the probability density function p(x)𝑝𝑥p(x)italic_p ( italic_x ). Among other advantages, this entropy normalization potentially enhances exploration by encouraging more stochastic policies [Haa+17].

VQC Architecture. The paper considers two different VQC architectures. The first one uses the typical three-part structure of rotational encoding, variational layers, and measurements. The second architecture is more complex, as it uses data re-uploading [Pér+20], and a more complex encoding structure [SJD22, Jer+21]. It can be expected, that the second choice gives rise to more expressive models, which usually correlates with RL performance.

Experimental Results. The experimental section of the paper compares the performance of the two resulting quantum SAC approaches to a classical NN on the Pendulum environment. On the one hand, the quantum model with the simple VQC architecture is inferior to the other two approaches. On the other hand, the quantum model with data re-uploading performs similar to the classical model, and both are able to learn near-optimal behavior. The quantum model incorporates only 41414141 parameters, while the classical one uses 1250125012501250. This is interpreted as an quantum advantage w.r.t. parameter complexity.

Some additional architecture experiments are conducted, mainly focusing one the depth of the underlying VQCs. It is observed, that a certain number of variational layers is required to enable training. Overall, the performance is strongly correlated with the concrete architecture choice, which is in line with the results known from literature [Fra+22].

Remarks. To substantiate the claim of quantum advances w.r.t. parameter complexity, more experiments with increasing environment size should be performed. By using NNs in combination with VQCs, it is not completely clear, which part of the learning is actually conducted by the quantum part. The differing performance of the two architecture choices highlight the importance of designing a sophisticated data encoding scheme.

Table 18: *

Algorithmic Characteristics - Q. Lan [Lan21] Environment Algorithm Quantum State Action Qubits Parameterized Type Component Space Space  Gates111 the hybrid model also incorporates additional classical parameters in an appended NN; Pendulum Quantum- Q𝑄Qitalic_Q-function continuous conti- nuous 3333 3333 to 12121212 (encoding) (OpenAI Gym) SAC 3333-dim 36363636 (weights)

Quantum Multi-Agent Reinforcement Learning via Variational Quantum Circuit Design, Yun et al. (2022)


Summary. This paper by Yun et al. [Yun+22] introduces a quantum multi-agent reinforcement learning (QMARL) approach. It is applied to an environment inspired by wireless communication. The authors achieve results that are competitive with classical NNs with higher parameter complexity.

QMARL Framework and VQC Architecture. The approach is inspired by the classical method of centralized training with decentralized execution (CTDE). This approach deals with the problems introduced by a non-stationary reward structure, caused by the interaction of multiple agents [Low+17].

The actor-critic structure employs only a single critic (i.e. represented by a single VQC), which receives the rewards. A naive implementation would would increase the qubit count with the number of agents. To resolve this problem, the state encoding routine is modified, such that only one qubit is required for each agent.

The general VQC architecture follows the typical three-part structure. The states are encoded using a feature map with 1111-qubit rotations. The state space of the environment is four-dimensional. Consequently, four qubits are used to represent the actor associated to each of the four agents. For the critic, all rotations for the state of one agent are applied to a single qubit. This implies a qubit count equal to the number of agents (i.e. implemented for 4444 qubits in the article). The following learnable layer(s) consist of 1111-qubit rotations and some unspecified entanglement structure. The choice of the measured observables M𝑀Mitalic_M are not explicitly stated.

Experimental Results and Discussion. The QMARL algorithm is applied to a communication task referred to as Single-Hop Offloading environment. It simulates two clouds, between which packages have to be distributed along four edges. Each cloud and edge has a queue with a certain capacity. One agent is used to learn the actions of its associated edge. The objective is to minimize the overflow and underflow of queues.

The paper compares four different multi-agent reinforcement learning (MARL) and QMARL frameworks: (1) The described version, where actor and critic are represented with a VQC; (2) A modified pipeline, where the critic is represented with a classical NN; (3) A small-scale classical MARL approach; All three setups contain 50505050 trainable parameters each. (4) A large-scale classical MARL algorithm with over 40000400004000040000 trainable parameters.

The results demonstrate, that the QMARL approach (1) is competitive with the large-scale MARL algorithm (4). In contrast, the hybrid QMARL method (2) and also the small-scale classical MARL seem to lack expressivity to solve this task. The authors conclude, that QMARL yields some quantum advantage, as the parameter complexity is drastically reduced.

Remarks. Potentially, compressing all observations of one agent into one qubit is not sufficient to represent the information in a lossless manner. Therefore, larger-scale experiments should be conducted to get more insights into the proposed quantum multi-agent architecture. The same holds for the reduced parameter complexity compared to classical models.

Table 19: *

Algorithmic Characteristics - Yun et al. [Yun+22] Environment Algorithm Quantum State Action Qubits Parameterized Type Component Space Space Gates Single-Hop Multi-Agent Q𝑄Qitalic_Q-function Policy continuous 4444-dim discrete 4444 4444 4444 or 16161616 (encoding)111 the 4444 quantum actors use 4444 encoding parameters each; the quantum centralized critic contains 16161616; N/A𝑁𝐴N/Aitalic_N / italic_A (weights) Offloading Actor-Critic (see [Yun+22]) QMARL

Quantum Multi-Agent Meta Reinforcement Learning, Yun et al. (2023)


Summary. The second paper by Yun et al. [YPK23] extends their previous approach [Yun+22] with various new techniques for QMARL. It proposes to use meta-learning by pre-training only one individual agent. This is followed by a fine-tuning the multi-agent scenario. Therefore, two different types of trainable parameters are used, i.e. trainable measurements are introduced to complement the typical variational parameters. The approach is also extended to continual learning, where meta-learning is performed on multiple environments at once.

VQC Architecture and meta-QMARL. The underlying RL algorithm employs an SAC approach with the VQC as function approximator for the action-value function. The quantum circuit uses the three-layer structure of 1111-qubit rotation data encoding, variational layers with entanglement gates, and measurement. The paper applies QRL to multi-agent problems and extends the original proposal on quantum CTDE by Yun et al. [Yun+22]. An additional step is introduced for the training procedure, resulting in a meta-learning approach.

In order to realize these concepts, the authors define two different sets of parameters. First, there are the typical variational parameters ϕbold-italic-ϕ\boldsymbol{\phi}bold_italic_ϕ, usually parameterizing 1111-qubit rotations. Second, it is also possible to parameterize and train the measurement observables. The paper proposes to use Mθ1,2(m)=Rx(θ1)Ry(θ2)ZRy(θ2)Rx(θ1)subscript𝑀superscriptsubscript𝜃12𝑚superscriptsubscript𝑅𝑥subscript𝜃1superscriptsubscript𝑅𝑦subscript𝜃2𝑍subscript𝑅𝑦subscript𝜃2subscript𝑅𝑥subscript𝜃1M_{\theta_{1,2}^{(m)}}=R_{x}^{\dagger}(\theta_{1})\cdot R_{y}^{\dagger}(\theta% _{2})\cdot Z\cdot R_{y}(\theta_{2})\cdot R_{x}(\theta_{1})italic_M start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = italic_R start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⋅ italic_R start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⋅ italic_Z ⋅ italic_R start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⋅ italic_R start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) as observable on the m𝑚mitalic_m-th qubit, i.e. two trainable parameters for each 1111-qubit observable. Basically, this trainable observable introduces a change of basis, as final measurements are always performed in the computational basis. The instantiated observable can be visualized on the Bloch sphere as the angle w.r.t. which the measurement is performed.

Both parameter sets are trained in alternating steps, where the first one is referred to as meta quantum neural network (QNN) angle training, and focuses exclusively on the variational parameters ϕbold-italic-ϕ\boldsymbol{\phi}bold_italic_ϕ. This step trains only a single quantum agent, which interacts with several other agents in the multi-agent environment. Unfortunately, the authors do not state how this interaction is actually realized. We assume, that the quantum agent interacts with other classical agents in this initial training phase. During training, the pole parameters 𝜽𝜽\boldsymbol{\theta}bold_italic_θ are not updated, but they can be varied with some randomly selected value to form a kind of angle-to-pole regularization. The second phase, the local QNN pole training, focuses on the parameterized observables. Those are fine-tuned individually for each copy of the meta-trained QNN, corresponding to the all-quantum agents interacting in the multi-agent environment. The authors propose, that by meta-training the network, it is more efficient to fine-tune the individual agents. This is justified with the lower parameter complexity, as the variational parameters remain constant in the second training phase. The loss function is the sum of all Q𝑄Qitalic_Q-learning losses of the individual agents.

Additionally, the paper introduces the concept of pole memory, which refers to storing the trained pole parameters for the individual agents. As these sets are much smaller than the set of variational parameters, it is more efficient to store the full configuration.

Experimental Results. The introduced training routing is executed on a two-step two-agent environment. It is observed, that the meta-training convergence is slower than direct training of a QMARL agent. However, once this training has converged, finetuning is much more efficient. Overall, the authors conclude, that the additional step of meta-training enhances convergence in a multi-agent environment.

Extension to Continual Learning. The above setting is also extended to continual learning, i.e. training in more than one environment (or typically the same environment with slightly altered dynamics).

The investigation focuses on the difference in performance with and without the use of pole memory. The results suggest that resetting the poles to the initial state (i.e. the parameter setting with which meta training was conducted) benefits convergence speed and stability in an environment with alternating dynamics. Meta training with a higher degree of angle-to-pole regularization seems to enhance the generalization performance of the meta-QNN.

Remarks. The paper does not state explicitly how exactly the initial meta training is conducted. Considering the results, the VQCs seem to have some capability w.r.t. transfer learning, as which the meta-learning and continual training can be interpreted. The idea of employing trainable observables has also potential for other approaches, as it partially avoids the necessity to explicitly pre-select an action decoding scheme. Practically, these trainable observables are introduced by adding an additional layer to the VQC which learns a specific measurement. A significant difference to pre-existing procedures is that these parameters are not trained simultaneously with the typical variational parameters. It is not completely clear, whether this two-step training procedure is beneficial in a general setup.

Table 20: *

Algorithmic Characteristics - Yun et al. [YPK23] Environment Algorithm Quantum State Action Qubits Parameterized Type Component Space Space  Gates111 the parameter counts are denoted for a single agent; Single-Hop Meta-Multi- Q𝑄Qitalic_Q-function continuous 4444-dim discrete 4444 4444 4444 (encoding) N/A𝑁𝐴N/Aitalic_N / italic_A (weights) Offloading Agent SAC (see [Yun+22]) “Meta-QMARL Two-Step Meta-Multi- Q𝑄Qitalic_Q-function continuous 4444-dim discrete 2222 2222 2222 (encoding) N/A𝑁𝐴N/Aitalic_N / italic_A (weights) Game Agent SAC (see [YPK23]) “Meta-QMARL

4.2.4 Offline Methods

Offline reinforcement learning [Lev+20] deals with the setting, when no direct interaction with the environment is possible. Instead, the agent is trained on a set of pre-acquired data. Two alternative formulations for the quantum realm have been proposed in Periyasamy et al. [Per+23] and Cheng et al. [Che+23].

Citation First Author Title

[Per+23]

M. Periyasamy

Batch Quantum Reinforcement Learning

[Che+23]

Z. Cheng

Offline Quantum Reinforcement Learning in a Conservative Manner

Table 21: Work considered for “QRL with VQCs– Offline Methods” (Sec. 4.2.4)
Batch Quantum Reinforcement Learning, Periyasamy et al. (2023)


Summary. In this work, Periyasamy et al. [Per+23] propose batch-constrained quantum Q𝑄Qitalic_Q-learning (BCQQ), a offline QRL algorithm based on the classical discrete batch-constrained deep Q𝑄Qitalic_Q-learning (BCQ) algorithm by Fujimoto et al. [Fuj+19]. Furthermore, the authors introduce a novel data re-uploading (DRU) scheme, which they call cyclic DRU. Experiments are executed in the OpenAI CartPole environment.

Algorithm. The key idea in BCQ is that in order to avoid a distributional shift from training to testing, a trained policy should induce at test time a similar state-action visitation to that observed in the the offline training data, the so-called batch. Hence, the name batch-constrained.

To achieve this, BCQ trains a generative model Gωsubscript𝐺𝜔G_{\omega}italic_G start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT to pre-select likely actions based on the batch. Through this selection, the policy is constrained to only choose from a subset of actions. In the case of a discrete action space, the generative model can be understood as a map Gω:𝒮Δ(𝒜):subscript𝐺𝜔𝒮Δ𝒜G_{\omega}:\mathcal{S}\rightarrow\Delta\left(\mathcal{A}\right)italic_G start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT : caligraphic_S → roman_Δ ( caligraphic_A ) that takes the current environment state as input and outputs the probability with which each action would occur in the batch. In particular, if the batch is filled using transitions from a policy πbsubscript𝜋𝑏\pi_{b}italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT then the generative model should imitated this policy, i.e. Gω(a|s)πb(a|s)subscript𝐺𝜔conditional𝑎𝑠subscript𝜋𝑏conditional𝑎𝑠G_{\omega}(a|s)\approx\pi_{b}(a|s)italic_G start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_a | italic_s ) ≈ italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_a | italic_s ). Therefore, Gωsubscript𝐺𝜔G_{\omega}italic_G start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT is called imitator.

Through using this imitator, actions can be pre-selected by discarding actions whose probability relative to the most likely one is below a threshold τ𝜏\tauitalic_τ

𝒜~(s)={a𝒜|Gω(a|s)maxa^𝒜Gω(a^|s)>τ}.~𝒜𝑠conditional-set𝑎𝒜subscript𝐺𝜔conditional𝑎𝑠subscriptmax^𝑎𝒜subscript𝐺𝜔conditional^𝑎𝑠𝜏\tilde{\mathcal{A}}(s)=\left\{a\in\mathcal{A}\Bigg{|}\frac{G_{\omega}(a|s)}{% \text{max}_{\hat{a}\in\mathcal{A}}G_{\omega}(\hat{a}|s)}>\tau\right\}.over~ start_ARG caligraphic_A end_ARG ( italic_s ) = { italic_a ∈ caligraphic_A | divide start_ARG italic_G start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_a | italic_s ) end_ARG start_ARG max start_POSTSUBSCRIPT over^ start_ARG italic_a end_ARG ∈ caligraphic_A end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( over^ start_ARG italic_a end_ARG | italic_s ) end_ARG > italic_τ } . (46)

The actions selected by the imitator are then evaluated by a Q𝑄Qitalic_Q-network, which is trained by only considering the selected actions in the loss computation. The imitator itself is trained with a standard cross-entropy loss

l(ω)=(s,a)log(Gω(a|s)).𝑙𝜔subscript𝑠𝑎logsubscript𝐺𝜔conditional𝑎𝑠l(\omega)=-\sum_{(s,a)\in\mathcal{B}}\text{log}\left(G_{\omega}(a|s)\right).italic_l ( italic_ω ) = - ∑ start_POSTSUBSCRIPT ( italic_s , italic_a ) ∈ caligraphic_B end_POSTSUBSCRIPT log ( italic_G start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_a | italic_s ) ) .

Additionally, to address the overestimation bias of Q𝑄Qitalic_Q-learning towards state transitions that are underrepresented in the batch, double DQN [VGS16] is employed.

Finally, the BCQQ algorithm is obtained by applying the variational quantum deep Q𝑄Qitalic_Q-networks (VQ-DQN) proposed by Franz et al. [Fra+22] as function approximators for both the imitator and Q𝑄Qitalic_Q-network. Moreover, for the model training the authors use the AMSgrad optimizer in combination with gradients approximated via SPSA. Wiedmann et al. [Wie+23] demonstrated that SPSA can be used to efficiently train medium-sized VQCs with a reduced number of circuit runs, compared to the commonly used parameter-shift rule.

Model Architecture. The VQC used as the function approximator for the imitator and Q𝑄Qitalic_Q-network is shown in Fig. 12. Each entry of the four-dimensional state vector returned by the CartPole environment is encoded using a single qubit Rx gate on an individual qubit. The variational block comprises five layers containing four parameterized Ry, and four parameterized Rzsubscript𝑅𝑧R_{z}italic_R start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT gates each. In addition to the parameterized rotational gates, each layer also includes two-qubit CZ entanglement gates with nearest-neighbor connectivity. The CartPole environment has two discrete actions. Therefore, the expectation value of the Pauli-ZZ𝑍𝑍ZZitalic_Z italic_Z observable on qubits 1 and 2 and Pauli-ZZ𝑍𝑍ZZitalic_Z italic_Z observable on qubits 3 and 4 are used to decode the Q𝑄Qitalic_Q-values from the VQC. Furthermore, trainable classical weights are applied on both expectation values to increase the range of possible Q𝑄Qitalic_Q-values.

Periyasamy et al. [Per+22] established that spreading encoding gates for the feature vector of a given data point throughout the quantum circuit results in an improved representation of the data when the expectation values are measured for observables containing all Pauli strings. Following this, the authors use a re-uploading scheme, which exposes each qubit to all the entries of the current input state vector. Contrary to the standard data re-uploading, where the encoding scheme is re-introduced after each variational layer as such, the encoding scheme is re-introduced with the input state vector shifted by one step in a round-robin fashion. The structure of a VQC with this cyclic DRU is shown on the right of Fig. 12.

Refer to caption
Figure 12: Left: VQC that is used as function approximator in the discrete BCQQ algorithm. Right: VQC with cyclic DRU. Note: Each θ𝜃\vec{\theta}over→ start_ARG italic_θ end_ARG block represents the repetition of the variational layer ansatz with different trainable parameters. Both taken from [Per+23].

Experimental Results and Discussion. In order to evaluate the performance of BCQQ, the authors train policies on buffers with varying sizes, filled with randomly sampled environment interactions. As a classical benchmark the authors train neural networks instead of VQCs on the same buffers. For this benchmark, they first use a fully connected neural network with a total number of 67270 parameters and second a smaller network with just 55 parameters. The number of parameters in the smaller network is much more comparable to the VQC. The authors find that the BCQQ agent is able to learn an optimal policy, achieving the maximum reward of 500, from a buffer of just 100 random environment interactions. Interestingly, the classical agents fail to learn a policy in this low data regime, suggesting a potential quantum advantage in terms of the sample efficiency.

Moreover, the cumulative reward these models can achieve beyond 500 is tested, which shows that the VQC with cyclic DRU out-performs the VQC with standard DRU. All these experiments were performed using an early stop** criteria, where during training the current policy is evaluated in the actual environment to save computational resources. Strictly speaking, this makes the training not fully offline. In a second experiment however, the authors train the VQC with cyclic DRU on a buffer filled with 100 interactions obtained from an optimal policy with noise. From this, the authors show that without early stop** the BCQQ agent can learn an optimal policy from this noisy buffer.

Remarks. It remains to be shown that the observed sample efficiency scales to more complex environments. Furthermore, a more elaborate analysis of the effectiveness of cyclic DRU could give insights for future VQC design.

Table 22: *

Algorithmic Characteristics - Periyasamy et al. [Per+23] Environment Algorithm Quantum State Action Qubits Parameterized Type Component Space Space  Gates111 encoding gates: qubits×per_qubit𝑞𝑢𝑏𝑖𝑡𝑠𝑝𝑒𝑟normal-_𝑞𝑢𝑏𝑖𝑡qubits\times per\_qubititalic_q italic_u italic_b italic_i italic_t italic_s × italic_p italic_e italic_r _ italic_q italic_u italic_b italic_i italic_t; variational gates: qubits×layers×per_qubit_per_layer𝑞𝑢𝑏𝑖𝑡𝑠𝑙𝑎𝑦𝑒𝑟𝑠𝑝𝑒𝑟normal-_𝑞𝑢𝑏𝑖𝑡normal-_𝑝𝑒𝑟normal-_𝑙𝑎𝑦𝑒𝑟qubits\times layers\times per\_qubit\_per\_layeritalic_q italic_u italic_b italic_i italic_t italic_s × italic_l italic_a italic_y italic_e italic_r italic_s × italic_p italic_e italic_r _ italic_q italic_u italic_b italic_i italic_t _ italic_p italic_e italic_r _ italic_l italic_a italic_y italic_e italic_r; CartPole (OpenAI Gym) BCQ Imitator, Q𝑄Qitalic_Q-function continuous 4444-dim discrete 2222 4444 4×1414\times 14 × 1 (encoding) 4×15×241524\times 15\times 24 × 15 × 2 (weights) N/A𝑁𝐴N/Aitalic_N / italic_A (classical)222 model incorporates classical weights after measurement, details are not stated;

Offline Quantum Reinforcement Learning in a Conservative Manner, Cheng et al. (2023)


Summary. This work by Cheng et al. [Che+23] introduces the offline QRL algorithm, conservative quantum Q𝑄Qitalic_Q-learning (CQ2L). In contrast to online RL, offline RL is used in scenarios where the agent cannot interact with the environment during training and is hence trained purely data-driven from a set of previously collected data. The proposed algorithm is based on the classical conservative Q𝑄Qitalic_Q-learning (CQL) algorithm by Kumar et al. [Kum+20]. Experiments are conducted in the OpenAI CartPole, Acrobot and MountainCar environments.

Algorithm. The objective of offline RL is to learn a near-optimal policy from a fixed dataset 𝒟𝒟\mathcal{D}caligraphic_D sampled with a behavior policy πbsubscript𝜋𝑏\pi_{b}italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, without further environment interactions. A major challenge in this setting is that the fundamental assumption that agents can sample data online is violated. This means that agents have to learn a policy or value function from out-of-distribution (OOD) data, which is nontrivial. This distributional shift makes it hard to evaluate and consequently improve current Q-value functions, leading to an extrapolation error [Kos+21].

Under the online setting, agents obtain corrective feedback through environment interactions. However, for offline training, the extrapolation error means that agents could overestimate Q𝑄Qitalic_Q-values for unseen state-action pairs, which could lead to poor performance. Hence, CQL suppresses the overestimation problem in offline RL by learning a conservative Q𝑄Qitalic_Q-value function. In particular, this is achieved via double Q𝑄Qitalic_Q-learning [VGS16] and a penalty term to update the Q-values in a conservative manner. The resulting conservative update target is obtained as

argmin𝑄α𝔼s𝒟(logaAexp(Q(s,a;θkA))𝔼aπbQ(s,a;θkA)))+𝔼(s,a,r,s)𝒟(YkDoubleQQ(s,a))2,\underset{Q}{\text{argmin}}\;\alpha\cdot\underset{s\sim\mathcal{D}}{\mathbb{E}% }\left(\log\sum_{a\in A}\exp(Q(s,a;\theta_{k}^{A}))-\underset{a\sim\pi_{b}}{% \mathbb{E}}Q(s,a;\theta_{k}^{A}))\right)+\underset{(s,a,r,s^{\prime})\sim% \mathcal{D}}{\mathbb{E}}\left(Y_{k}^{\text{DoubleQ}}-Q(s,a)\right)^{2},underitalic_Q start_ARG argmin end_ARG italic_α ⋅ start_UNDERACCENT italic_s ∼ caligraphic_D end_UNDERACCENT start_ARG blackboard_E end_ARG ( roman_log ∑ start_POSTSUBSCRIPT italic_a ∈ italic_A end_POSTSUBSCRIPT roman_exp ( start_ARG italic_Q ( italic_s , italic_a ; italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ) end_ARG ) - start_UNDERACCENT italic_a ∼ italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_UNDERACCENT start_ARG blackboard_E end_ARG italic_Q ( italic_s , italic_a ; italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ) ) ) + start_UNDERACCENT ( italic_s , italic_a , italic_r , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∼ caligraphic_D end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT DoubleQ end_POSTSUPERSCRIPT - italic_Q ( italic_s , italic_a ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (47)

with the double Q𝑄Qitalic_Q-learning target update

YkDoubleQ:=r+γQ(s,argmaxa¯AQ(s,a¯;θkA);θkB).assignsuperscriptsubscript𝑌𝑘DoubleQ𝑟𝛾𝑄superscript𝑠¯𝑎𝐴𝑎𝑟𝑔𝑚𝑎𝑥𝑄superscript𝑠¯𝑎superscriptsubscript𝜃𝑘𝐴superscriptsubscript𝜃𝑘𝐵Y_{k}^{\text{DoubleQ}}:=r+\gamma\cdot Q(s^{\prime},\underset{\overline{a}\in A% }{argmax}\;Q(s^{\prime},\overline{a};\theta_{k}^{A});\theta_{k}^{B}).italic_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT DoubleQ end_POSTSUPERSCRIPT := italic_r + italic_γ ⋅ italic_Q ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , start_UNDERACCENT over¯ start_ARG italic_a end_ARG ∈ italic_A end_UNDERACCENT start_ARG italic_a italic_r italic_g italic_m italic_a italic_x end_ARG italic_Q ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over¯ start_ARG italic_a end_ARG ; italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ) ; italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ) . (48)

Here, θkAsuperscriptsubscript𝜃𝑘𝐴\theta_{k}^{A}italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT and θkBsuperscriptsubscript𝜃𝑘𝐵\theta_{k}^{B}italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT denote two independent sets of parameters, which are updated similarly to the target network in the deep Q𝑄Qitalic_Q-network (DQN) algorithm, by symmetrically exchanging the roles of θkAsuperscriptsubscript𝜃𝑘𝐴\theta_{k}^{A}italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT and θkBsuperscriptsubscript𝜃𝑘𝐵\theta_{k}^{B}italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT in Eq. 48. Having these independent parameters helps to compute unbiased Q𝑄Qitalic_Q-value estimates. CQ2L is then obtained by implementing the Q𝑄Qitalic_Q-value function via the variational VQ-DQN proposed by Franz et al. [Fra+22].

Model Architecture. VQCs with 5 layers to represent Q𝑄Qitalic_Q-value functions are used. For CartPole, Acrobot and MountainCar 4, 6 and 2 qubit systems are used, respectively. According to the feasible actions in these environments, quantum observables [Z0Z1,Z2Z3]subscript𝑍0subscript𝑍1subscript𝑍2subscript𝑍3[Z_{0}Z_{1},Z_{2}Z_{3}][ italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ], [Z0,Z1,Z2]subscript𝑍0subscript𝑍1subscript𝑍2[Z_{0},Z_{1},Z_{2}][ italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ], and [Z0,Z0Z1,Z1]subscript𝑍0subscript𝑍0subscript𝑍1subscript𝑍1[Z_{0},Z_{0}Z_{1},Z_{1}][ italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] are chosen, where Zisubscript𝑍𝑖Z_{i}italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the readout of a Pauli Z gate on the i𝑖iitalic_ith qubit. Input data are encoded with X rotation gates, while the variational part includes X, Y, and Z rotation gates. Moreover, qubits are entangled in a circular topology. The variational part, entanglement, and data encoding are repeated several times, which is then measured by Pauli Z gates to determine the Q𝑄Qitalic_Q-values.

Experimental Results and Discussion. To evaluate the offline QRL algorithm, the authors create offline data sampled by a DQN agent with epsilon-greedy policy, interacting with the corresponding environment. The sampled data are recorded in a replay buffer with length 1×1061superscript1061\times 10^{6}1 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT and then saved for offline QRL. The logged data contain tuples of (st,at,rt,st+1,d)subscript𝑠𝑡subscript𝑎𝑡subscript𝑟𝑡subscript𝑠𝑡1𝑑(s_{t},a_{t},r_{t},s_{t+1},d)( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_d ), where d𝑑ditalic_d indicates whether an episode terminates. For training, a single trajectory from the collected buffer is selected.

The authors compare the performance of CQ2L with the off-policy VQ-DQN trained offline on the same data. These experiments show that CQ2L is able to solve all given environments and outperform offline VQ-DQN. The latter indicates that it is not feasible to directly extend off-policy QRL algorithms like VQ-DQN to the offline setting. Furthermore, the authors find that CQ2L performs only marginally worse than online VQ-DQN in CartPole. Interestingly, online VQ-DQN fails to solve Acrobot and MountainCar and is clearly outperformed by CQ2L.

Finally, the performance is compared to classical CQL, where a fully connected neural network with a similar number of parameters as the VQC is used. The results indicate that CQ2L could achieve comparable performance to the classical one. Besides, no significant advantages in the sample efficiency or the parameter size are observed. The authors hypothesize that this may indicate that the current structure of VQCs or the limited number of qubits is not sufficient to exhibit quantum advantages for QRL.

Remarks. The performance is compared to classical CQL, where a fully connected neural network with a similar number of parameters as the VQC is used. The results indicate that CQ2L could achieve comparable performance to the classical one. Besides, no significant advantages in the sample efficiency or the parameter size are observed. The authors hypothesize that this may indicate that the current structure of VQCs or the limited number of qubits is not sufficient to exhibit quantum advantages for QRL. This result contradicts other observations in the literature, where at least for small system sizes some improvement w.r.t. parameter complexity was observed. However, we agree with the statement, that such performance improvements might strongly depend on the specific VQC architecture.

Table 23: *

Algorithmic Characteristics - Cheng et al. [Che+23] Environment Algorithm Quantum State Action Qubits Parameterized Type Component Space Space  Gates111 encoding gates: qubits×per_qubit𝑞𝑢𝑏𝑖𝑡𝑠𝑝𝑒𝑟normal-_𝑞𝑢𝑏𝑖𝑡qubits\times per\_qubititalic_q italic_u italic_b italic_i italic_t italic_s × italic_p italic_e italic_r _ italic_q italic_u italic_b italic_i italic_t; variational gates: qubits×layers×per_qubit_per_layer𝑞𝑢𝑏𝑖𝑡𝑠𝑙𝑎𝑦𝑒𝑟𝑠𝑝𝑒𝑟normal-_𝑞𝑢𝑏𝑖𝑡normal-_𝑝𝑒𝑟normal-_𝑙𝑎𝑦𝑒𝑟qubits\times layers\times per\_qubit\_per\_layeritalic_q italic_u italic_b italic_i italic_t italic_s × italic_l italic_a italic_y italic_e italic_r italic_s × italic_p italic_e italic_r _ italic_q italic_u italic_b italic_i italic_t _ italic_p italic_e italic_r _ italic_l italic_a italic_y italic_e italic_r; CartPole (OpenAI Gym) CQL Q𝑄Qitalic_Q-function continuous 4444-dim discrete 2222 4444 4×1414\times 14 × 1 (encoding) 4×15×241524\times 15\times 24 × 15 × 2 (weights) N/A𝑁𝐴N/Aitalic_N / italic_A (classical)222 model incorporates classical weights after measurement, details are not stated; Acrobot (OpenAI Gym) CQL Q𝑄Qitalic_Q-function continuous 6666-dim discrete 3333 6666 4×1414\times 14 × 1 (encoding) 4×15×241524\times 15\times 24 × 15 × 2 (weights) N/A𝑁𝐴N/Aitalic_N / italic_A (classical)222 model incorporates classical weights after measurement, details are not stated; MountainCar (OpenAI Gym) CQL Q𝑄Qitalic_Q-function continuous 2222-dim discrete 3333 2222 4×1414\times 14 × 1 (encoding) 4×15×241524\times 15\times 24 × 15 × 2 (weights) N/A𝑁𝐴N/Aitalic_N / italic_A (classical) 222 model incorporates classical weights after measurement, details are not stated;

4.2.5 Algorithmic and Conceptual Extensions

This section describes extensions to the VQC-based QRL framework, that have relevance for multiple of the previously classified methods. This entails tools to deal with partially observable (quantum) environments discussed in Kimura et al. [Kim+21]. A big emphasis is put on the explicit design of model architectures. Work by Hsiao et al. [Hsi+22, Tru+23] demonstrates that this is indeed an important topic, as otherwise everything could be easily emulated with classical architectures. Different approaches to this design task are discussed in Refs. [Che23c, Che23a, Dră+22, Kru+23, SMT23, ACN23, PPR20]. Avoiding the typical gradient-based training routines, a evolutionary approach is proposed by Chen et al. [Che+22] and also discussed in Refs. [DS23, Köl+23].

Citation First Author Title

[Kim+21]

T. Kimura

Variational Quantum Circuit-Based Reinforcement Learning for POMDP and Experimental Implementation

[Hsi+22]

J.-Y. Hsiao

Unentangled quantum reinforcement learning agents in the OpenAI Gym

[Tru+23]

N. Truong

Investigating Quantum Reinforcement Learning structure to the CartPole control task

[Che23c]

S. Y.-C. Chen

Quantum deep recurrent reinforcement learning

[Che23a]

S. Y.-C. Chen

Efficient quantum recurrent reinforcement learning via quantum reservoir computing

[Dră+22]

T.-A. Drăgan

Quantum Reinforcement Learning for Solving a Stochastic Frozen Lake Environment and the Impact of Quantum Architecture Choices

[Kru+23]

G. Kruse

Variational Quantum Circuit Design for Quantum Reinforcement Learning on Continuous Environments

[SMT23]

Y. Sun

Differentiable Quantum Architecture Search for Quantum Reinforcement Learning

[ACN23]

E. Andrés

Efficient Dimensionality Reduction Strategies for Quantum Reinforcement Learning

[Che+22]

S. Y.-C. Chen

Variational quantum reinforcement learning via evolutionary optimization

[DS23]

L. Ding

Multi-objective evolutionary search for parameterized quantum cirucits

[Köl+23]

M. Kölle

Multi-Agent Quantum Reinforcement Learning using Evolutionary Optimization

Table 24: Work considered for “QRL with VQCs– Algorithmic and Conceptual Extensions” (Sec. 4.2.5)
Variational Quantum Circuit-Based Reinforcement Learning for POMDP and Experimental Implementation, Kimura et al. (2021)


Summary. The paper by Kimura et al. [Kim+21] extends the concept of VQC-based RL to partially observable environments. The approach is inspired by classical model-free, complex-valued RL [HSS06]. Additionally, a novel VQC architecture (novel with regard to measurement procedure) is proposed. A detailed description of the gradient computation with backpropagation techniques is provided (it is not quite clear how this method generalizes to quantum hardware).

Partially Observable MDP. A partially observable Markov decision process (POMDP) is described as a tuple (S,A,T,R,Ω,O)𝑆𝐴𝑇𝑅Ω𝑂(S,A,T,R,\Omega,O)( italic_S , italic_A , italic_T , italic_R , roman_Ω , italic_O ) and is a generalization of a MDP. The variable S𝑆Sitalic_S denotes a discrete state space, A𝐴Aitalic_A is a discrete set of actions, T(s|s,a)𝑇conditionalsuperscript𝑠𝑠𝑎T(s^{\prime}|s,a)italic_T ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) describes the state transition probabilities and R(s,a)𝑅𝑠𝑎R(s,a)italic_R ( italic_s , italic_a ) is a reward function. Extending the fully-observable case, ΩΩ\Omegaroman_Ω is a discrete set of observations and O(o|s,a)𝑂conditional𝑜𝑠𝑎O(o|s,a)italic_O ( italic_o | italic_s , italic_a ) is an observation probability matrix with oΩ𝑜Ωo\in\Omegaitalic_o ∈ roman_Ω.

One caveat of partially observable environments is the perceptual aliasing problem. This refers to the property, that the agent cannot distinguish two different states due to the limited observation ability. An example of such an environment is the partially observable maze used in Kimura et al. [Kim+21]. Similar to most gridworld environments, the task is to navigate from the start state to the goal state on the shortest path possible. However, the observations provided to the agent are ambiguous as several cells return the same state indicator.

Solving POMDPs with Complex Valued RL. One way to bypass this state ambiguity is to introduce a belief distribution over possible states. Unfortunately, this is computationally expensive. An alternative approach is complex-valued RL [HSS06, MNM17]. It incorporates time series information into the action-value function, which represented as complex numbers. More concretely, the complex Q˙˙𝑄\dot{Q}over˙ start_ARG italic_Q end_ARG-function (x˙˙𝑥\dot{x}over˙ start_ARG italic_x end_ARG denotes complex values) encodes the history of the agent, i.e. the previously visited states. The cumulative reward value is expressed by the absolute value of Q˙˙𝑄\dot{Q}over˙ start_ARG italic_Q end_ARG-function, while the path length of the propagated reward is represented by the phase of the Q˙˙𝑄\dot{Q}over˙ start_ARG italic_Q end_ARG-function on the complex plane. Therefore, Q˙˙𝑄\dot{Q}over˙ start_ARG italic_Q end_ARG-function-Learning keeps continuity w.r.t. the described internal reference value. This helps distinguish states which are affected by the perceptual aliasing problem. Formally, this is achieved by updating the complex values in the opposite phase direction. The complex-valued Q˙˙𝑄\dot{Q}over˙ start_ARG italic_Q end_ARG-function can be represented with tabular methods [HSS06], or with complex-valued NNs [MNM17].

The update mechanism represents a generalized Q𝑄Qitalic_Q-learning approach, i.e. the objective is to optimize the loss function Lθ=12|Q˙(otk,atk)(rt+1+γQ˙max(t))ut˙(k)|subscript𝐿𝜃12˙𝑄subscript𝑜𝑡𝑘subscript𝑎𝑡𝑘subscript𝑟𝑡1𝛾subscript˙𝑄𝑚𝑎𝑥𝑡˙subscript𝑢𝑡𝑘L_{\theta}=\frac{1}{2}\cdot|\dot{Q}(o_{t-k},a_{t-k})-(r_{t+1}+\gamma\cdot\dot{% Q}_{max}(t))\dot{u_{t}}(k)|italic_L start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ⋅ | over˙ start_ARG italic_Q end_ARG ( italic_o start_POSTSUBSCRIPT italic_t - italic_k end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - italic_k end_POSTSUBSCRIPT ) - ( italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + italic_γ ⋅ over˙ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ( italic_t ) ) over˙ start_ARG italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( italic_k ) |. Here, ut˙(k)=β˙k+1˙subscript𝑢𝑡𝑘superscript˙𝛽𝑘1\dot{u_{t}}(k)=\dot{\beta}^{k+1}over˙ start_ARG italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( italic_k ) = over˙ start_ARG italic_β end_ARG start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT is a complex hyperparameter and k𝑘kitalic_k is the trace length. The NN is replaced with a VQC as action-value function approximator in the following.

VQC Architecture and Gradient Computation. The paper deviates in several design choices from the standard method. Most importantly, the Q˙˙𝑄\dot{Q}over˙ start_ARG italic_Q end_ARG-values for the different actions are not extracted from the same circuit (e.g. measurement on different qubits corresponding to different actions). Instead, the actions are encoded into the VQC with a feature map similar to the one used for state encoding. Consequently, different circuits have to be evaluated for each action. This encoding can happen either directly in the feature map, or alternatively into the decoding unitary. However, the three-part structure of the circuit is preserved.

Refer to caption
Figure 13: Encoding and decoding architectures proposed by and taken from [Kim+21];

The encoding unitary Uencodersubscript𝑈𝑒𝑛𝑐𝑜𝑑𝑒𝑟U_{encoder}italic_U start_POSTSUBSCRIPT italic_e italic_n italic_c italic_o italic_d italic_e italic_r end_POSTSUBSCRIPT consists of simple parameterized 1111-qubit rotations, where three different concrete encodings are considered as shown in Fig. 13. The Type 1 feature map encodes the observations directly with an arcsinarcsine\arcsinroman_arcsin function. Type 2 uses a computational encoding for the observations, basically equivalent to the one proposed by Chen et al. [Che+20]. Type 3 also uses an arcsinarcsine\arcsinroman_arcsin transform, but directly encodes the action information into the feature map.

The variational part repeats several layers of parameterized 1111-qubit rotations, followed by a circular entanglement structure. In the experimental part, the authors consider different circuit depths.

The output of the circuit is evaluated with the Hadamard Test, which measures the prepared state against an output unitary Uoutsubscript𝑈𝑜𝑢𝑡U_{out}italic_U start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT. This introduces an overhead since a controlled version of the unitary Uoutsubscript𝑈𝑜𝑢𝑡U_{out}italic_U start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT needs to be implemented (details in Fig. 13). Additionally, an ancilla qubit, and three 1111-qubit gates are required. The output unitary itself consists also of learnable 1111-qubit rotations. If encoding unitaries of Type 1 or Type 2 are used, it additionally encodes the action information. The real and imaginary output of the Hadamard test are used to construct the complex-valued Q˙˙𝑄\dot{Q}over˙ start_ARG italic_Q end_ARG-function.

The evaluation of the circuits is straightforward on quantum hardware. However, this does not apply to evaluating gradients w.r.t. the parameters, which is necessary for training. The paper gives a detailed derivation on how to compute the gradients via simulation on classical hardware. The idea is inspired by classical backpropagation and somewhat looks like the adjoint method [Luo+20]. This makes it infeasible, at least in the given form, for actual quantum hardware.

Experimental Results and Discussion. The paper compares the training results (on the described maze environment) for the three types of quantum agents to different classical agents. The classical tabular approach outperforms all other methods, as the underlying algorithm guarantees an optimal solution. The authors argue, that there seems to be some intrinsic advantage of the Type 2 quantum circuits, as these perform better then the other approximate algorithms.

Remarks. We think there needs to be some further investigation regarding the applicability of the algorithm to actual quantum hardware. Currently, we propose to consider the approach as QiRL. We agree, that QC offers great potential for complex-valued RL, as QC itself deals with complex numbers. However, there are still open questions regarding the most promising way to exploit this connection. A quantum version of a POMDP is discussed in Ref. [BBA14], which might provide for an interesting extension of this paper.

Unentangled quantum reinforcement learning agents in the OpenAI Gym, Hsiao et al. (2022)


Summary. The paper by Hsiao et al. [Hsi+22] uses an hybrid proximal policy optimization (PPO) algorithm, with a combination of VQC and NN as policy function approximator. The quantum circuit architecture is untypical, as it only uses 1111-qubit rotations. Consequently, no entanglement is created, and all qubits can be considered as independent systems. Still, the resulting RL agent is able to learn good policies on some standard environments (CartPole, Acrobot, and LunarLander). The learned parameters are ported to quantum hardware and tested with sophisticating results.

Underlying RL Algorithm and Model Architecture. The classical RL algorithm is PPO, i.e. an policy-based approach. It follows the typical hybrid setup, as the VQC is used as function approximator, and parameter updates are computed on classical hardware. To enhance the expressivity of the model, a classical NN is appended. It uses the measured expectation values as inputs. The outputs of the network are post-processed using a softmax function.

The structure of the hybrid model is displayed in Fig. 14. The feature map consists of 1111-qubit rotations, which is a common choice in the literature. The variational (‘parameter’ in Fig. 14) layer incorporates 1111-qubit parameterized rotations. It is important to highlight that the circuit does not contain any multi-qubit gates. Consequently, no entanglement between the qubits is created. As efficient classical simulation of the circuit is possible, the approach should be counted towards QiRL. Despite this, the authors demonstrate, that a good RL training performance can be achieved with this model.

Refer to caption
Figure 14: Hybrid quantum-classical model proposed by and taken from Hsiao et al. [Hsi+22];

Experimental Results. The described hybrid agent is trained on three tasks from the OpenAI Gym, i.e. CartPole, Acrobot, and LunarLander. The quantum agents outperforms several classical architectures. As this is achieved with much fewer parameters, the authors claim that the approach points towards potential advantage.

The results on LunarLander are remarkable in that regard, that it might be the most complex environment solved with VQC-based RL thus far. While the classical simulability prohibits any intrinsic quantum advantage, the models still are able to achieve a good performance. This gives rise to the questions, whether one can draw inspiration from quantum mechanics for purely classical approaches.

Testing on Quantum Hardware. Once the models are trained, they are tested with the learned parameters on IBMQ hardware (with up to 8888 qubits, depending on the environment). The models are able to replicate the learned near-optimal behavior.

Remarks. As without entanglement the VQCs can be simulated classically, we agree with the authors that the proposed algorithm should be considered as a QiRL approach. As the proposed model incorporates also a classical network, it is nor clear, what part of the learning is conducted with the VQC. The simple circuit structure might also explain, that the results for testing on hardware are stable. Usually, a big portion of the noise is caused by two-qubit gates, which are not present in the used VQC. A partial re-implementation of this work can also be found in Ref. [Tru+23].

Compendium of Architecture Discussions


As demonstrated by the previously discussed work of Hsiao et al. [Hsi+22], it is important to put careful consideration into the design of the employed quantum model architecture. In the following, we briefly summarize several works that make contributions in that direction. The idea of incorporating the information of multiple timesteps via recurrent networks is discussed by S. Y.-C. Chen [Che23c] and extended in Ref. [Che23a]. Several explicit VQC architectures are compared and analyzed by Drǎgan et al. [Dră+22]. An automated approach for architecture generation is proposed in Sun et al. [SMT23]. Different encoding techniques are discussed by Andrés et al. [ACN23]. Drawing a connection to a different context, the work by Park et al. [PPR20] proposes to vary the architecture itself, by dynamically in- and excluding two-qubit gates.

Recurrent Quantum Neural Networks. The work by S. Y.-C. Chen [Che23c] proposes the use of quantum recurrent neural networks (QRNNs) in the Q𝑄Qitalic_Q-learning setting (see Sec. 4.2.1), specifically quantum long short-term memory (QLSTM) [CYF22]. This enables the agent to also incorporate information from previous timesteps into the decision process. In is experimentally demonstrated on the CartPole environment, that the QRNN is at least least competitive – if not superior – to purely classical models of similar size. It is also discussed that the method might be well suited for partially observable environments, establishing a connection to [Kim+21]. A continuation of this line of research in Ref. [Che23a] proposes a more efficient training routine for QRNN, based on reservoir computing [LJ09] and the QA3C approach discussed in Ref. [Che23].

Explicit Architecture Comparison. A study by Drǎgan et al. [Dră+22] compares various circuit architectures for a modified version of the FrozenLake environment. The underlying algorithm is a quantum version of PPO (see Sec. 4.2.2) and the VQCs are combined with classical NNs to a hybrid model. The results suggest that the performance is strongly dependent on the choice of VQC architecture. Measures like expressibility [SJA19], entanglement capability [SJA19], and effective dimension [Abb+21] provide an a priori indicator for the potential suitability of the architecture. However, there seems to be no clear correlation between the concrete value of these measures and the RL performance.

Continuous Environments and Encoding. The work by Kruse et al. [Kru+23] extends the actor-critic paradigm (discussed e.g. in Ref. [Dră+22]) to continuous action spaces. The authors demonstrate that the quantum agent is able to learn in the environments Pendulum-v1 and LunarLander-v2. It is conjectured, that applying an arctan function to data points – as often done in literature – is indeed counter-productive for the overall performance. Moreover, a stacked encoding is proposed, which uses angle encoding on multiple qubits for a single data point. This allows to avoid pre-processing with a classical neural network, ensuring potential performance improvements can really be attributed to the quantum agent. On both benchmarks a reduction in parameter complexity compared to classical agents is reported. However, this only holds true for certain design choices, which again highlights the importance of architecture selection.

Automatic Generation of Architectures. Sun et al. [SMT23] propose an automated tool for the generation of QRL-suitable circuit architectures. The method is based on differential quantum architecture search (DQAS) [Zha+22], i.e. the architecture itself is trained using gradient-based methods. The approach is studied within the framework of quantum Q𝑄Qitalic_Q-learning (see Sec. 4.2.1) on the FrozenLake environment. Using DQAS, the authors are able to identify a VQC architecture that seems to be very well-suited for the given problem and outperforms some typically used problem-agnostic circuit designs.

Encoding Considerations. The work by Andrés et al. [ACN23] compares different strategies for encoding data into the VQC, all within the context of quantum Q𝑄Qitalic_Q-learning (see Sec. 4.2.1). Evaluations are conducted on three environments within the energy-efficiency and management context. The authors compare three different architecture layouts: (1) classical data is pre-processed and reduced in dimensionality using a NN and encoded via rotational parameters; (2) similar, but data re-uploading [Pér+20] is employed; (3) classical data is normalized and encoded via amplitude encoding [SP18], output is post-processed with a NN; The authors claim superior performance compared to classical models of similar size, especially using amplitude encoding. However, it has to be noted, that the experiments were quite small-scale. The combination with NNs complicates statements on the actual contribution of the quantum part. It also has to be noted, that amplitude encoding might not be NISQ-compatible in the general case.

Variational quantum reinforcement learning via evolutionary optimization, Chen et al. (2022)


Summary. The main focus of the paper by Chen er al. [Che+22] is the investigation of gradient-free evolutionary optimization for Q𝑄Qitalic_Q-learning with VQCs. This routine is tested in two different scenarios, for each of which also a state encoding scheme is proposed. More concretely, amplitude encoding is applied to the CartPole environment. For the gridworld environment MiniGrid with larger state space (147147147147 dimensional), the paper proposes a hybrid model with an encoding mechanism based on tensor network (TN) techniques.

Amplitude Encoding. The observation space of the CartPole environment is 4444-dimensional. The state values are continuous. This allows the use of amplitude encoding, i.e. two qubits can be used to encode the (re-scaled) values into the four amplitudes of the system. The authors follow the method described in Schuld and Petruccione [SP18]. This works fine for small systems, but requires not NISQ-compatible operators for bigger instances.

TN-based Encoding. The MiniGrid environment is similar to FrozenLake, as the goal in both environments is to navigate from a start to a goal state on the shortest way possible. The paper uses simple environment configurations, with state spaces of size 5×5555\times 55 × 5, 6×6666\times 66 × 6, and 8×8888\times 88 × 8. The observation space is of dimensionality 7×7×37737\times 7\times 37 × 7 × 3. The agent has to decide between 6666 actions, of which only 4444 are relevant in the simplified scenario. The reward is defined as 10.9number_steps/max_number_steps10.9number_stepsmax_number_steps1-0.9\cdot\mathrm{number\_steps}/\mathrm{max\_number\_steps}1 - 0.9 ⋅ roman_number _ roman_steps / roman_max _ roman_number _ roman_steps. Apart from the larger observation space, we assume this environment to be about the same complexity as FrozenLake.

The paper addresses the problem of encoding the 147147147147-dimensional state into a quantum feature map with just 8888 variational parameters. Other work uses e.g. CNNs to reduce the dimensionality of the feature space [LS21]. As the encoding networks have to be pre-trained, it is not quite clear, what part of the work is really done by the VQC. The authors suggest to use a hybrid encoding scheme based on TNs, similar to Chen et al. [Che+21]. The proposed TN technique encodes the observation [v1,,v147]tsuperscriptsubscript𝑣1subscript𝑣147𝑡[v_{1},\cdots,v_{147}]^{t}[ italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_v start_POSTSUBSCRIPT 147 end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT into the product state [1v1,v1]t[1v2,v2]t[1vN,vN]ttensor-productsuperscript1subscript𝑣1subscript𝑣1𝑡superscript1subscript𝑣2subscript𝑣2𝑡superscript1subscript𝑣𝑁subscript𝑣𝑁𝑡[1-v_{1},v_{1}]^{t}\otimes[1-v_{2},v_{2}]^{t}\otimes\cdots\otimes[1-v_{N},v_{N% }]^{t}[ 1 - italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⊗ [ 1 - italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⊗ ⋯ ⊗ [ 1 - italic_v start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, where the individual elements are normalized. Those encoded states represented by the red nodes in Fig. 15a. The trainable part of the matrix product state (MPS) outputs an 8888-dimensional compressed feature vector. This is represented by the 147+11471147+1147 + 1 blue nodes and the open leg (i.e. outgoing edge) in Fig. 15a. The bond dimension is a hyperparameter of the MPS, which correlates with the number of trainable parameters [Per+06].

Refer to caption
(a) TN for performing dimensionality reduction;
Refer to caption
(b) VQC with feature map, several variational layers, and 1111-qubit measurements;
Figure 15: Components of the architecture proposed by and taken from Chen et al. [Che+22].

VQC Architecture. The model follows the typical three-part architecture, i.e. first the feature map, then the variational part, and finally some measurements. For the CartPole environment, a simple 2222-qubit circuit with amplitude encoding and 4444 variational layers is used. Both qubits are measured in the Pauli-Z𝑍Zitalic_Z basis and the action corresponding to the higher expectation value is selected. For the MiniGrid environment, the 8888-qubit circuit with just one repetition of the variational layer is used. The encoding is done with the TN-compressed state, i.e. the output from the TN is encoded into the circuit as shown in Fig. 15b. As the environment has 6666 actions, the top 6666 qubits are measured, and the action corresponding to the highest expectation value is executed.

RL with Evolutionary Optimization. The underlying algorithm is a Q𝑄Qitalic_Q-learning RL approach. The updates of the QNN representing the action-value function are conducted via evolutionary optimization. This implies, that no gradients have to be computed. Usually, this is one major bottleneck of VQC-based RL, which might be circumvented by this approach.

The paper uses a simplistic instance of an evolutionary algorithm, where mutation, but no recombination operations are employed. An initial population of M𝑀Mitalic_M individuals is generated, which are used to simulate some episodes on the environment. The best T𝑇Titalic_T agents (the ones producing the highest reward averaged over several runs) are selected as parents for the next generation. Random Gaussian noise is applied to this parents (mutation), until M1𝑀1M-1italic_M - 1 children are generated. Additionally, the best individual from the previous generation is kept, i.e. again M𝑀Mitalic_M individuals. This procedure is repeated until a certain convergence criteria is met, e.g. a high enough reward.

Experimental Findings and Discussion. The paper applies the two different encoding methods, combined with the evolutionary optimization idea, to the respective environments. All experiments are conducted as noiseless simulations. On the CartPole environment, the 2222-qubit architecture achieves an near-optimal performance with only 26262626 parameters, which is significantly less than in most state-of-the-art NNs. The authors claim, that with their method the number of parameters can be reduced to 𝒪(polylog(n))𝒪polylog𝑛\mathcal{O}(\mathrm{polylog}(n))caligraphic_O ( roman_polylog ( italic_n ) ). In contrast, classical ML requires 𝒪(poly(n))𝒪poly𝑛\mathcal{O}(\mathrm{poly}(n))caligraphic_O ( roman_poly ( italic_n ) ) parameters.

The experiments on the MiniGrid environment employ the described hybrid TN-based architecture. Results are compared to an encoding based on a simple NN, presumably similar to Lockwood and Si [LS21]. All approaches achieve a near-optimal performance. Overall, the TN-model (with large enough bond dimension) slightly outperforms the classical approach. The authors consider this as a proof-of-principle for effectiveness of the MPS encoding for RL learning.

Remarks. The amplitude encoding is currently not feasible for more complex problems, due to the lack of an NISQ-compatible state-preparation routine. The evolutionary optimization approach could circumvent some of the problems typically associated with gradient based techniques. Experiments on larger-scale environments might be an interesting direction for future work, to investigate how the evolutionary algorithm deals with more complex optimization landscapes. We suggest to incorporate some recombination procedures into the evolutionary algorithm, to enhance its performance.

Multi-Objective Formulation. Related work by Ding and Spector [DS23] proposes a version of evolutionary search for the automated generation of QRL architectures (see also the discussions on VQC architecture above in Ref. [Hsi+22] and related work). The training itself is done with a QPG approach [Jer+21] (see Sec. 4.2.2) and nested with evolutionary architecture search [DS22]. This procedure is conducted w.r.t. several objectives, including enforcing a as-small-as-possible model size and several noise-related considerations. The approach is validated on the three benchmark environments CartPole, MountainCar, and Acrobot. The results demonstrate improved training behavior – with smaller model size – compared to previous work [Jer+21]. The authors also further analyze the learned architectures for recurring patterns. However, it is acknowledge that larger-scale experiments are necessary to identify a general guideline for architecture selection.

Multi-Agent Scenario. The work by Kölle et al. [Köl+23] extends the framework of Ref. [Che+22] to the multi-agent setting (see Sec. 4.2.3). The authors compare different evolutionary strategies, including mutation-only and two different setups with additional recombination steps. The evaluation is conducted on the CoinGame environment and yields results that are competitive with classical approaches – using significantly fewer parameters. It has to be noted, that the experiments are too small-scale to make reliable statements about the scaling behaviour of this approach. While evolutionary optimization is certainly an interesting consideration compared to gradient-based techniques, the stated advantage regarding reduced proneness to barren plateaus is not sufficiently documented and should therefore be viewed with some scepticism.

4.2.6 Application-Focused Work

This section summarizes work that discusses VQC-based QRL techniques for specific applications. On the one hand, this is a very important area of research, in order to identify practically relevant QRL one day. On the other hand, it has to be noted, that all current work is limited to relatively small problem setups. This can be justified by current hardware restrictions – but also casts some doubt on the scalability of the stated results. Nonetheless, an overview of the considered ideas might be beneficial for further research:

Applications related to robotics and similar control tasks are discussed in Refs. [Acu+22, Hei+22, Cob23, BYK22, SMK23, Hic+23, KCP23]. Planning tasks of different form are the focus of Refs. [Cor+23, San+22, ACN22, Liu+23, Kum+23, Rai+23, SH23, RKM22]. Collaborative environments are addressed with multi-agent methods in Refs. [Yan+22, Par+23, NS+23, Par+23a, PK23, Yun+23, Ans+23]. The field of finances is discussed in Refs. [Che+23b, Yan23]. A back-to-the-roots work considers QRL for board games in Ref. [CRC23]. Last but not least, the task of designing VQC architectures is addressed in Ref. [Che23d].

Citation First Author Title

[Acu+22]

A. Acuto

Variational Quantum Soft Actor-Critic for Robotic Arm Control

[Hei+22]

D. Heimann

Quantum Deep Reinforcement Learning for Robot Navigation Tasks

[Cob23]

J. Cobussen

Quantum Reinforcement Learning for Sensor-Assisted Robot Navigation Tasks

[BYK22]

N. F. Bar

An Approach Based on Quantum Reinforcement Learning for Navigation Problems

[SMK23]

A. Sinha

Nav-Q: Quantum Deep Reinforcement Learning for Collision-Free Navigation of Self-Driving Cars

[Hic+23]

M. L. Hickmann

Potential analysis of a Quantum RL controller in the context of autonomous driving

[KCP23]

G. S. Kim

Realizing Stabilized Landing for Computation-Limited Reusable Rockets: A Quantum Reinforcement Learning Approach

[Cor+23]

R. Correll

Quantum Neural Networks for a Supply Chain Logistics Application

[San+22]

F. Sanches

Short quantum circuits in reinforcement learning policies for the vehicle routing problem

[ACN22]

E. Andrés

On the Use of Quantum Reinforcement Learning in Energy-Efficiency Scenarios

[Liu+23]

D. Liu

Multi-agent quantum-inspired deep reinforcement learning for real-time distributed generation control of 100% renewable energy systems

[Kum+23]

M. Kumar

Blockchain Based Optimized Energy Trading for E-Mobility Using Quantum Reinforcement Learning

[Rai+23]

S. Rainjonneau

Quantum Algorithms applied to Satellite Mission Planning for Earth Observation

[SH23]

M. Shahid

Introducing Quantum Variational Circuit for Efficient Management of Common Pool Resources

[RKM22]

F. Rezazadeh

Towards Quantum-Enabled 6G Slicing

Table 25: [Part 1] Work considered for “QRL with VQCs– Application-Focused Work” (Sec. 4.2.6)
Citation First Author Title

[Yan+22]

R. Yan

A Multiagent Quantum Deep Reinforcement Learning Method for Distributed Frequency Control of Islanded Microgrids

[Par+23]

S. Park

Quantum Multi-Agent Actor-Critic Networks for Cooperative Mobile Access in Multi-UAV System

[NS+23]

B. Narottama

Layerwise Quantum Deep Reinforcement Learning for Joint Optimization of UAV Trajectory and Resource Allocation

[Par+23a]

S. Park

Quantum Multi-Agent Reinforcement Learning for Autonomous Mobility Cooperation

[PK23]

S. Park

Quantum Reinforcement Learning for Large-Scale Multi-Agent Decision-Making in Autonomous Aerial Networks

[Yun+23]

W. J. Yun

Quantum Multi-Agent Actor-Critic Neural Networks for Internet-Connected Multi-Robot Coordination in Smart Factory Management

[Ans+23]

J. A. Ansere

Quantum Deep Reinforcement Learning for Dynamic Resource Allocation in Mobile Edge Computing-based IoT Systems

[Che+23b]

E. A. Cherrat

Quantum Deep Hedging

[Yan23]

J. Yang

Apply Deep Reinforcement Learning with Quantum Computing on the Pricing of American Options

[CRC23]

J. Chao

Quantum Enhancements for AlphaZero

[Che23d]

S. Y.-C. Chen

Quantum Reinforcement Learning for Quantum Architecture Search

Table 26: [Part 2] Work considered for “QRL with VQCs– Application-Focused Work” (Sec. 4.2.6)
QRL for Robotics and other Control Tasks


The work by Acuto et al. [Acu+22] applies the quantum SAC approach proposed in Ref. [Lan21] to the control of an robotic arm. The environment is implemented as an extension of the Acrobot-v1 environment. On this small-scale setup the hybrid quantum model demonstrates reduced parameter complexity compared to classical methods.

A robot navigation scenario is discussed by Heimann et al. [Hei+22] in a simulated environment. The quantum Q-learning (see Sec. 4.2.1) approach demonstrates parameter reduction compared to classical approaches. The setup is extended to a more complex environment by J. Cobussen [Cob23].

A similar robot navigation task is considered in Bar et al. [BYK22], which employs the Q-learning method proposed in Ref. [Che+20]. The authors report a reduction in the number of parameters, which however also yields a decreased success rate for the considered scenarios.

Collision-free navigation of self-driving cars is considered in Sinha et al. [SMK23]. The authors employ an actor-critic quantum A2C approach, which is similar to the QA3C introduced by Ref. [Che23]. On a small 4444-qubit toy environment the proposed approach shows improved training stability compared to classical A2C. A similar problem is considered with tools from quantum Q𝑄Qitalic_Q-learning (see Sec. 4.2.1) by Hickmann et al. [Hic+23].

The task of steering reusable rockets is considered in Kim et al. [KCP23]. The unspecified QRL method demonstrates reduced memory requirements (by requiring fewer parameters) on an 8888-qubit toy environment.

QRL for Planning Tasks


The vehicle routing problem (VRP) is considered by Correll et al. [Cor+23] via an quantum-enhanced attention mechanism. Several parts of a classical encoder-decoder model with attention mechanism [KVW18] are replaced with medium-scale VQCs (up to 10101010 qubits). With using quantum methods to implement orthogonal NNs [KLM21], a potential speed-up during inference is reported. Experimental on a simple instance of the traveling salesman problem (TSP) are conducted to support this claim. A simpler approach for the same task is considered in Sanches et al. [San+22], where only the attention heads are replaced with 4444-qubit VQCs.

The work by Andrés et al. [ACN22] considers different planing tasks related to energy-efficiency scenarios. The authors employ quantum actor-critic methods (see Sec. 4.2.2) to address these tasks. The authors report a slower convergence compared to classical methods, however therefore a reduced parameter complexity. Similar scenarios within the energy context are also discussed by Liu et al. [Liu+23] and Kumar et al. [Kum+23].

The task of satellite mission planning is formulated as a scheduling problem and addressed by Rainjonneau et al. [Rai+23]. The authors apply two different quantum-enhanced methods within this context: (1) policy approximation (see Sec. 4.2.2) with VQCs; (2) replacing several components of AlphaZero with quantum components, similar as to discussed in Ref. [CRC23]; The experiments with 4444-qubit circuits demonstrate a clear improvement compared to straightforward greedy methods.

The problem of distributing common pool resources is discussed by Shahid and Hassan [SH23]. Quantum-enhanced Q𝑄Qitalic_Q-learning (see Sec. 4.2.1) is applied to an 8888-qubit toy environment, and superior training performance compared to classical models of similar size is reported.

A task from mobile communication (6G slicing) is considered in Rezazadeh et al. [RKM22]. The authors employ the VQC-based Q𝑄Qitalic_Q-learning approach proposed in Ref. [Che+20] and claim improvements w.r.t. parameter complexity and the potential for distributed computing.

QRL in Collaborative Scenarios


Different tasks that are based on the collaboration of multiple entities are discussed in a series of work by Yan et al. [Yan+22], Park et al. [Par+23, Par+23a, PK23], Yun et al. [Yun+23], Narottama et al. [NS+23], and Ansere et al. [Ans+23]. The foundation is the multi-agent approach QMARL proposed in Ref. [Yun+22] with smaller extensions. On respective toy environments, the approaches demonstrate faster convergence and reduced parameter complexity compared to classical implementations.

QRL for Finances


The work by Cherrat et al. [Che+23b] addresses the task of deep hedging with distributional actor-critic methods. Classical methods are modified with quantum-enhanced orthogonal NNs [KLM21], which promises speed-ups during inference. This is supported by medium-scale hardware test on up to 16161616 qubits – which makes this one of the largest-scale demonstrations of VQC-based QRL.

Another work within the context of finances, conducted by J. Yang [Yan23], proposes the use of quantum Q-learning (see Sec. 4.2.1) to speed up calculations.

QRL for Games


The work by Chao et al. [CRC23] thinks back to the origins of classical RL and consider es the board game Orthello, which basically is a simplified version of Go. To solve this toy environment, the authors modify two components of AlphaZero [Sil+18]: (1) replacing function approximators with VQCs; (2) using tensor network methods for feature extraction; For simulations on up to 12121212 qubits, the methods show performance compared to classical approaches.

QRL for Architecture Design


S. Y.-C. Chen [Che23d] addresses the task of quantum circuit design. The author uses the actor-critic method QA3C [Che23] to generate circuits that prepare 2222-qubit Bell states and GHZ states on up to 3333 qubits.

4.3 Projective Simulation for Quantum Reinforcement Learning

Citation First Author Title

[BD12]

H. J. Briegel

Projective simulation for artificial intelligence

[Mel+17]

A. A. Melnikov

Projective simulation with generalization

[Boy+20]

W. L. Boyajian

On the convergence of projective-simulation–based reinforcement learning in Markov decision processes

[Pap+14]

G. D. Paparo

Quantum Speedup for Active Learning Agents

[Tei21]

M. Teixeira

Quantum Reinforcement Learning Applied to Games

[TRC21]

M. Teixeira

Quantum Reinforcement Learning Applied to Board Games

[DFB15]

V. Dunjko

Quantum-enhanced deliberation of learning agents using trapped ions

[Sri+18]

T. Sriarunothai

Speeding-up the decision making of a learning agent using an ion trap quantum processor

[Fla+23]

F. Flamini

Towards interpretable quantum machine learning via single-photon quantum walks

Table 27: Work considered for “Projective Simulation for QRL” (Sec. 4.3)
Projective simulation for artificial intelligence, Briegel et al. (2012) and related work


Summary. Projective simulation for artificial intelligence by Briegel et al. [BD12] is the first in a series of articles, which propose a learning scheme for creative behavior. This is understood in the sense that the agent can deal with unseen experiences by relating to other conceivable situations. The method is developed for classical agents. There is only a brief final paragraph, outlining a quantum-mechanical implementation. Since subsequent papers ‘quantize’ the original idea heavily, a brief summary is in order: The approach is based on a random walk on a previous-experience network (memory), simulating an agent pondering its next action. More specifically, previous experiences compose a network of clips, which is dynamically modified by new experiences. It is important to note that clips, in contrast to actual experiences, are e.g. remembered observations, states or actions. To select the next action, an observation of the agent activates a clip, followed by a random walk through the network (projective simulation). This is repeated until an action is ‘excited’ and coupled out from the network and the action is selected. It is worthwhile noting that the term projective as used here is not related to its use in quantum physics, such as in projective measurement.

Action Selection. The process of action selection is slightly more sophisticated than described above. If a percept s𝑠sitalic_s is observed, a random walk through the network starts from the corresponding percept clip. After some deliberation time the random walk reaches an action clip, which is only out-coupled and taken in reality if the percept-action pair (s,a)𝑠𝑎(s,a)( italic_s , italic_a ) was rewarded in the past (i.e. tagged positively). If not, a new simulation is started. This process repeats until an action clip with positive tag, or a predefined reflection time is reached; in the latter case the action is out-coupled irrespective of the tag.

Learning Procedure. The actual learning process can be summarized as follows:

  1. 1.

    If a transition (s,a)𝑠𝑎(s,a)( italic_s , italic_a ) is rewarded, increase the network weight of the direct transition sa𝑠𝑎s\rightarrow aitalic_s → italic_a. (Note that the agent might have chosen sa𝑠𝑎s\rightarrow aitalic_s → italic_a after many steps of PS; by reinforcing the direct transition, it might be exploited directly next time);

  2. 2.

    Increase the weights of the indirect transition (all weights of the network that led to the transition sa𝑠𝑎s\rightarrow aitalic_s → italic_a in the random walk through the network). Thus, the agent discovers useful actions after deliberation of fictitious clips;

  3. 3.

    Introduce dam** of all weights to let the agent forget, in order to be able to adapt to new situations (as appearing for example in a time-dependent environment);

  4. 4.

    If a new situation is discovered, a corresponding clip is added to the network and directed edges from all the other clips to the new one are added;

  5. 5.

    Additional extensions can be implemented, such as modifications of clips and creation of completely fictitious compositions of episodes;

This line of research has been continued in Ref. [Mel+17] (generalization) and [Boy+20] (convergence). In the last paragraph of Ref. [BD12] a quantum version of the algorithm is briefly discussed. The idea is to replace the random walk on the network by a quantum walk. A number of subsequent papers investigate the quantum approach more rigorously:

In order to define a quantum walk algorithm as done in Ref. [Pap+14], the PS approach is viewed slightly different. The given clip network with the percept set S𝑆Sitalic_S is separated in |S|𝑆|S|| italic_S | disjoint networks. Thus one obtains a directed weighted graph (a Markov chain) for each percept with action clips as absorber states. Each of the actions is initially flagged (corresponding to the emotion tags of the initial projected simulation proposal). If an actual out-coupled action did not lead to a reward, this particular flag is removed. Now the action selection proceeds in the following way: If the agent observes a percept s𝑠sitalic_s, a random walk starts through the graph (deliberation) until an action is reached, which is out-coupled only if the action is flagged (reflection). Thus, action selection corresponds to sampling from the conditional probability distribution over the flagged action space. Given the transition matrix P𝑃Pitalic_P of the Markov chain, subsequent applications of P𝑃Pitalic_P to the initial state (probability one for the percept clip) realizes the approximate stationary distribution (subsequently referred to as diffusion). Sampling from this distribution and disregarding un-flagged actions produces the correct samples. As pssubscript𝑝𝑠p_{s}italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is the probability to sample a flagged action from the equilibrium distribution obtained by diffusion, one needs to repeat the sampling process 𝒪(1/ps)𝒪1subscript𝑝𝑠\mathcal{O}(1/p_{s})caligraphic_O ( 1 / italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) times until a flagged action is sampled. The quantum random walk search algorithm is closely related to Grover’s algorithm. By elevating the transition matrix to a diffusion operator and introducing an oracle that marks flagged actions, the quantum algorithm only needs 𝒪(1/ps)𝒪1subscript𝑝𝑠\mathcal{O}(1/\sqrt{p_{s}})caligraphic_O ( 1 / square-root start_ARG italic_p start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG ) oracle calls. Consequently, a quadratic speed-up for the deliberation process can be achieved. Therefore, this quantum algorithm speeds up the agent’s internal computation time for action selection. This technique is extended and applied to board games in Refs. [Tei21, TRC21].

Experimental Implementation. In Ref. [DFB15] the authors investigate the implementation of the algorithm proposed by Ref. [Pap+14] on an ion trap quantum computer. Results are also backed up by numerical simulations. The actual proof-of-principle experiment with two qubits is discussed in Ref. [Sri+18], where signatures of the quadratic speed up are observed. Ref. [Fla+23] proposes a quantum-optics based implementation of the projective simulation paradigm. Here, the random walk through the clip network is promoted to a quantum walk of a single photon through an optical interferometer. Outcoupling of an action then corresponds to an occupation number measurement of output modes.

4.4 Boltzmann Machines for Quantum Reinforcement Learning

Citation First Author Title

[Jer+21a]

S. Jerbi

Quantum Enhancements for Deep Reinforcement Learning in Large Spaces

[Cra+18]

D. Crawford

Reinforcement Learning Using Quantum Boltzmann Machines

[Sch+22]

M. Schenk

Hybrid actor-critic algorithm for quantum reinforcement learning at CERN beam lines

[Lev+17]

A. Levit

Free energy-based reinforcement learning using a quantum processor

Table 28: Work considered for “Boltzmann Machines for QRL” (Sec. 4.4)
Quantum Enhancements for Deep Reinforcement Learning in Large Spaces, Jerbi et al. (2021) and related work


Summary. The work presented in Ref. [Jer+21a] investigates an alternative NN architecture to those often used for learning the Q𝑄Qitalic_Q-function (or more generally the merit function) in RL tasks. The authors argue that these alternative models perform advantageously in large action spaces. This is due to their capability to represent multimodal functions better than standard network architectures, while using a similar number of parameters. It is further found that these alternative architectures are closely related to energy-based models, some of which admit quantum representations. In turn, this allows quantum evaluations, enabling a provable quantum speed-up for fault-tolerant quantum computing.

Motivation. The standard architecture for Q𝑄Qitalic_Q-learning with NNs is depicted in Fig. 16 (upper part). The representation of a state is fed into a NN, which outputs the values of the so-called merit function (the Q𝑄Qitalic_Q-value in case of Q𝑄Qitalic_Q-learning) for each possible action (given the state). The policy can be derived from this function by with softmax post-processing. The effective-temperature parameter is decreased over time to reduce exploration and enhance exploitation.

The authors argue that this NN architecture is not suited for large action spaces. It has to output a high dimensional function, i.e. the merit functions for all actions simultaneously for a given state. The authors argue that this network is unable to approximate a multimodal merit function in case of complex state-action correlations. Instead, the authors discuss the NN structure shown in the lower part of Fig. 16. Here, the state and action is fed to the NN, which outputs the corresponding merit function. Action selection is done by sampling from the probability distribution, given by a softmax function on the values of the merit function. Therefore, sampling requires |A|𝐴|A|| italic_A | forward passes, where A𝐴Aitalic_A is the action set, making action selection a computationally expensive task for large action spaces.

Experiments are conducted on a generalized GridWorld environment with a large set of actions. The associated complex transition function gives rise to one optimal and many sub-optimal policies. The authors find that the NN architecture shown in the lower part of Fig. 16 indeed performs better, but at the cost of the expensive sampling described before.

Energy-based Models. The potential for quantum speed up comes from the observation that the second architecture in Fig. 16 is equivalent to a certain kind of energy-based model. Energy-based function approximators are used for generative modeling of probability distributions based on the Boltzmann-Gibbs distribution with respect to an energy functional. Boltzmann machines are one instance of such energy-based models where the energy functional is given by a spin-spin interaction model. However, Boltzmann machines are hard to train which led to the development of restricted Boltzmann machines where a special interaction structure with a hidden layer enables more efficient training. In Ref. [Jer+21a] the authors observe that the lower architecture in Fig. 16 is equivalent to a generalized form of restricted Boltzmann machines.

Refer to caption
Figure 16: Difference between architectures in Q𝑄Qitalic_Q-learning (upper part of the figure) and energy-based models (lower part of the figure) as shown in Jerbi et al. [Jer+21a]

Quantum Speed-Up. Inspired by this insight, the authors next investigate quantum energy-based models. Here, the classical spin-spin interaction energy is promoted to a spin Hamiltonian, known as quantum Boltzmann machines and restricted quantum Boltzmann machines. Some of these models allow efficient training, while the hardness of sampling remains. To speed up sampling in the classical and quantum setting, the following quantum subroutines for a RL algorithm are discussed:

(1) Quantum Gibbs sampling: The Gibbs-Boltzmann distribution is prepared as a qsample, from which expectations values can be sampled with quadratic speed-up, compared to classical Monte-Carlo sampling methods. (2) Gibbs-state preparation by Hamilton simulation: Using Hamilton-simulation techniques, an approximation to the Gibbs qsample can be prepared, leading to quadratic speed-up compared to exact sampling (calculating all energies and explicitly normalizing the probability distribution). (3) Quantum simulated annealing: This method uses a quantum method for the approximate Monte-Carlo sampling of the Gibbs state itself by leveraging quantum random walks on graphs.

All methods discussed so far need oracularized access to the Hamiltonian and it is unlikely that they could be realized on current hardware. A realization on near-term hardware might be achieved by (4) Variational Gibbs-state preparation: Here, a variational circuit can be employed to approximate a Gibbs qsample, using the free energy as an objective. Any quantum speed up, however, for this method is heuristic and has not been made rigorous so far.

Remarks. Related work [Cra+18, Sch+22, Lev+17] proposes models based on quantum Boltzmann machines for quantum annealing hardware. Since this literature survey focuses on algorithms proposed for gate-based QC, we do not include a detailed summary here.

4.5 Quantum Policy and Value Iteration

So far, we have considered QRL algorithms that employ QC for function approximation or propose quantum approaches to alternative learning frameworks such as PS. We now turn to proposals that replace subroutines of existing RL frameworks by quantum algorithms such as amplitude estimation, quantum maximum finding and, respectively, quantum matrix inversion. As a result, the proposed QRL algorithms guarantee improved sample or computational complexity. As these methods need oracular access to the environment, they should be categorized as post-NISQ algorithms.

Citation First Author Title

[Wan+21]

D. Wang

Quantum algorithms for reinforcement learning with a generative model

[Gan+23]

B. Ganguly

Quantum Computing Provides Exponential Regret Improvement in Episodic Reinforcement Learning

[Zho+23]

H. Zhong

Provably Efficient Exploration in Quantum Reinforcement Learning with Logarithmic Worst-Case Regret

[GA23]

B. Ganguly

Quantum Acceleration of Infinite Horizon Average-Reward Reinforcement Learning

[CKP23]

E. A. Cherrat

Quantum Reinforcement Learning via Policy Iteration

[Wie+22]

S. Wiedemann

Quantum Policy Iteration via Amplitude Estimation and Grover Search - Towards Quantum Advantage for Reinforcement Learning

Table 29: Work considered for “Quantum Policy and Value Iteration” (Sec. 4.5)
Quantum algorithms for reinforcement learning with a generative model, Wang et al. (2021)


Summary. The work in Ref. [Wan+21] proposes two algorithms for RL with a generative model and rigorously derives bounds for their sample complexity.

Classical Generative Models. Classically, the term generative model describes a simulator, which queried with a state-action pair (s,a)𝑠𝑎(s,a)( italic_s , italic_a ), produces a sample sP(|s,a)s^{\prime}\sim P(\cdot|s,a)italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P ( ⋅ | italic_s , italic_a ). Thus, by repeated sampling for each state-action pair, one can estimate the transition matrix of the underlying MDP. This allows to subsequently obtain an approximation of the optimal policy by means of value iteration. Over the years there has been tremendous effort devoted to improving sample efficiency (defined as the number of times the simulator has to be queried). This performance metric is meaningful if one assumes that every query of the simulator is costly. The best classical algorithm [Li+20a] requires a total number of 𝒪(|S||A|Γ3/ϵ2)𝒪𝑆𝐴superscriptΓ3superscriptitalic-ϵ2\mathcal{O}(|S||A|\Gamma^{3}/\epsilon^{2})caligraphic_O ( | italic_S | | italic_A | roman_Γ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) samples, where |S|𝑆|S|| italic_S | and |A|𝐴|A|| italic_A | are the number of states and actions, Γ=1/(1γ)Γ11𝛾\Gamma=1/(1-\gamma)roman_Γ = 1 / ( 1 - italic_γ ) is the effective horizon of the MDP, and ϵitalic-ϵ\epsilonitalic_ϵ is the deviation of the optimal value function from the approximation. The sample complexity is linear in the product |S||A|𝑆𝐴|S||A|| italic_S | | italic_A |, since the transition matrix has to be estimated for each (s,a)𝑠𝑎(s,a)( italic_s , italic_a ). The factor 1/ϵ21superscriptitalic-ϵ21/\epsilon^{2}1 / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT originates from Hoeffding’s inequality (indeed bounding the deviation of a sample average from its real value by ϵitalic-ϵ\epsilonitalic_ϵ, requires 𝒪(1/ϵ2)𝒪1superscriptitalic-ϵ2\mathcal{O}(1/\epsilon^{2})caligraphic_O ( 1 / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) samples). The origin of the third power of ΓΓ\Gammaroman_Γ, in contrast, is less intuitive. Note that the sample complexity of the classical algorithm is also a lower bound (in the classical case) and therefore optimal.

Incorporating Quantum Subroutines. As shown in Ref. [Wan+21], the classical sample complexity can be reduced by replacing the classical mean-estimation subroutine in Ref. [Li+20a] by a quantum routine based on the quantum mean-estimation algorithm [Bra+02]. Even though the optimal classical algorithm is more sophisticated as outlined above and so is its quantization, the following discussion captures the essential features. The quantum subroutine requires the generative model in oracle form and can then be used to estimate the expectation value 𝔼(V)=sP(s|s,a)V(s)𝔼𝑉subscriptsuperscript𝑠𝑃conditionalsuperscript𝑠𝑠𝑎𝑉superscript𝑠\mathbb{E}(V)=\sum_{s^{\prime}}P(s^{\prime}|s,a)V(s^{\prime})blackboard_E ( italic_V ) = ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) italic_V ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) (which appears in the Bellman equation) individually for every pair (s,a)𝑠𝑎(s,a)( italic_s , italic_a ) in time 𝒪(1/ϵ)𝒪1italic-ϵ\mathcal{O}(1/\epsilon)caligraphic_O ( 1 / italic_ϵ ). This quadratic speed-up originates from Grover’s algorithm, on which the quantum-mean estimation algorithm is based upon. As a consequence, the quantum-policy iteration algorithm achieves the sample complexity 𝒪(|S||A|Γ1.5/ϵ)𝒪𝑆𝐴superscriptΓ1.5italic-ϵ\mathcal{O}(|S||A|\Gamma^{1.5}/\epsilon)caligraphic_O ( | italic_S | | italic_A | roman_Γ start_POSTSUPERSCRIPT 1.5 end_POSTSUPERSCRIPT / italic_ϵ ) with an quadratic improvement in ΓΓ\Gammaroman_Γ and ϵitalic-ϵ\epsilonitalic_ϵ.

The dependence on the size of the action space can be further reduced by using quantum maximum finding [Mon15] to calculate the maximum over actions in the Bellman optimality equation. However, using this quantum routine, one can not fully exploit the power of the classical optimal algorithm. Hence, while the dependence on |A|𝐴|A|| italic_A | is reduced quadratically and the improvement in ϵitalic-ϵ\epsilonitalic_ϵ is kept, the improvement in ΓΓ\Gammaroman_Γ is lost. As a result, the algorithm based on both quantum-mean estimation and quantum maximum finding achieves a sample complexity 𝒪(|S||A|Γ3/ϵ)𝒪𝑆𝐴superscriptΓ3italic-ϵ\mathcal{O}(|S|\sqrt{|A|}\Gamma^{3}/\epsilon)caligraphic_O ( | italic_S | square-root start_ARG | italic_A | end_ARG roman_Γ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT / italic_ϵ ).

Finally, the lower bound 𝒪(|S||A|Γ1.5/ϵ)𝒪𝑆𝐴superscriptΓ1.5italic-ϵ\mathcal{O}(|S||A|\Gamma^{1.5}/\epsilon)caligraphic_O ( | italic_S | | italic_A | roman_Γ start_POSTSUPERSCRIPT 1.5 end_POSTSUPERSCRIPT / italic_ϵ ) is derived and possible improvements of the algorithm to reach this limit are discussed.

Quantum computing provides exponential regret improvement in episodic reinforcement learning, Ganguly et al. (2023)


Summary. In Ref. [Gan+23] and independently in Ref. [Zho+23] the authors consider the problem of an agent operating in a finite-horizon episodic tabular MDP and investigate if quantum computing can alleviate the exploration-exploitation trade-off. This problem has been considered for the case of bandits [Wan+23, LHT22, LZ22] but is here generalized to the full multi-state RL problem. In the online setting, the agent only has access to the next state and reward given its current state and chosen action. In contrast, previous work [Wan+21] assumed access to a generative model, which can be queried with arbitrary state-action pairs producing samples of the next state and reward. This setting does not consider the exploration-exploitation trade-off that arises from online interaction with the environment. Here, the agent must learn to discover high-reward states by a suitable exploration strategy. The performance of the agent in this problem can be measured by the regret, which is defined as the cumulative difference between the optimal value function and its approximation after K𝐾Kitalic_K episodes. The goal is to design an algorithm with the weakest scaling of the regret in K𝐾Kitalic_K, indicating a more effective trade-off between exploration and exploitation. The classical UCB-VI algorithm achieves the lower bound Ω(K)Ω𝐾\Omega(\sqrt{K})roman_Ω ( square-root start_ARG italic_K end_ARG ) [JOA10, AOM17] of the regret. The proposed quantum algorithm in Ref. [Gan+23] builds upon this classical algorithm by replacing the mean estimation routine with a quantum algorithm. Given a state-action pair, the quantum algorithm assumes a ‘transition oracle’ which generates a quantum superposition over all possible next states with amplitudes given by the square root of the respective transition probabilities. A similar oracle is used for generating rewards. The algorithm utilizes the quantum multivariate mean estimation algorithm [Ham21], which reduces the number of samples required to satisfy a given error bound for mean estimation quadratically. The result is a decrease of the regret of the quantum algorithm from 𝒪(K)𝒪𝐾\mathcal{O}(\sqrt{K})caligraphic_O ( square-root start_ARG italic_K end_ARG ) to 𝒪(1)𝒪1\mathcal{O}(1)caligraphic_O ( 1 ) up to logarithmic factors. This is an exponential improvement over classical results. In a follow-up work by the same authors [GA23], the results were extended to infinite horizon problems, where an exponential reduction in regret from 𝒪(T)𝒪𝑇\mathcal{O}(\sqrt{T})caligraphic_O ( square-root start_ARG italic_T end_ARG ) to 𝒪(1)𝒪1\mathcal{O}(1)caligraphic_O ( 1 ) (T𝑇Titalic_T being the total number of time steps) is achieved. Additionally, Ref. [Zho+23] consideres linear function approximation and demonstrates that the exponential improvement is maintained.

Quantum Reinforcement Learning via Policy Iteration, Cherrat et al. (2023)

Summary. Ref. [CKP23] proposes a quantum algorithm for an iterative scheme of Q𝑄Qitalic_Q-value evaluation and policy improvement. The algorithm evaluates the Q𝑄Qitalic_Q-value on a quantum computer, with the state vector representing the Q𝑄Qitalic_Q-values, being extracted by measurements. The policy afterwards is improved on a classical device. The algorithm can achieve quantum advantage in certain situations.

To set up the general framework, the authors first formulate the Bellman equation for Q𝑄Qitalic_Q-value evaluation as a matrix equation [LP03]

Q=R+γPΠQ.𝑄𝑅𝛾𝑃Π𝑄Q=R+\gamma P\Pi Q\,.italic_Q = italic_R + italic_γ italic_P roman_Π italic_Q .

Denoting the size of the action and state space as |A|𝐴|A|| italic_A | and |S|𝑆|S|| italic_S |, the |A||S|𝐴𝑆|A||S|| italic_A | | italic_S | dimensional vectors Q𝑄Qitalic_Q and R𝑅Ritalic_R represent the Q𝑄Qitalic_Q-values and the reward vector, respectively; the environment transition function is the |A||S|×|S|𝐴𝑆𝑆|A||S|\times|S|| italic_A | | italic_S | × | italic_S | dimensional matrix P𝑃Pitalic_P; the policy is represented by an |S|×|A||S|𝑆𝐴𝑆|S|\times|A||S|| italic_S | × | italic_A | | italic_S |-dimensional matrix ΠΠ\Piroman_Π; γ𝛾\gammaitalic_γ denotes the usual discounting factor; The authors propose to compute (11γPΠ)1Rsuperscript11𝛾𝑃Π1𝑅(1\!\!1-\gamma P\Pi)^{-1}R( 1 1 - italic_γ italic_P roman_Π ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_R on a quantum device.

Quantum Subroutine: Block Encodings and Linear Algebra. To perform this task, Ref. [CKP23] relies on so-called block encodings of matrices [Gil+19]. This powerful framework gives rise to various quantum algorithms for encoding general complex (not necessarily rectangular) matrices in the leading principal block of a larger unitary matrix. Once the data has been loaded, the framework further provides linear-algebra routines such as matrix multiplication, addition [Gil+19] and inversion [CKS17]. The encoding algorithms need quantum access to the data, i.e. via oracles. Therefore, the methods can be attributed to the post-NISQ algorithms category. A well-known data-loading scheme is the sparse-input model, viable for sparse matrices. The authors of Ref. [CKP23] apply a more general scheme, the so-called μp(A)subscript𝜇𝑝𝐴\mu_{p}(A)italic_μ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_A ) [CGJ19] block encoding of a matrix A𝐴Aitalic_A. Here, the quality (i.e. the probability to obtain the correct output of the algorithm, e.g. after matrix-vector multiplication and a subsequent measurement) of the encoding depends on the maximum of the column and row norms of the matrix. The aforementioned norm is a function of p𝑝pitalic_p and can be chosen freely to optimize the encoding quality. Based on this formalism, the authors show that policy evaluation requires time

𝒪(μPΓpolylog(|S||A|Γ/ϵ)).𝒪subscript𝜇𝑃Γpolylog𝑆𝐴Γitalic-ϵ\mathcal{O}(\mu_{P}\Gamma\mathrm{polylog}(|S||A|\Gamma/\epsilon))\,.caligraphic_O ( italic_μ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT roman_Γ roman_polylog ( | italic_S | | italic_A | roman_Γ / italic_ϵ ) ) . (49)

In Eq. 49, the parameter Γ=(1γ)1Γsuperscript1𝛾1\Gamma=(1-\gamma)^{-1}roman_Γ = ( 1 - italic_γ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, ϵitalic-ϵ\epsilonitalic_ϵ denotes the accuracy of the matrix inversion subroutine. The term μPsubscript𝜇𝑃\mu_{P}italic_μ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT describes the quality of the encoding of the environment-transition matrix, which depends on the structure of the environment. In the worst case it scales as |S||A|𝑆𝐴\sqrt{|S||A|}square-root start_ARG | italic_S | | italic_A | end_ARG. Due to the sparsity of the transition function of many environments, a better scaling is often expected. As discussed below, for the frozen-lake environment one even finds μP=𝒪(1)subscript𝜇𝑃𝒪1\mu_{P}=\mathcal{O}(1)italic_μ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT = caligraphic_O ( 1 ). The complexity in Eq. 49 assumes an efficient loading routine for the matrices. To achieve efficient loading also for the policy matrix, a QRAM data structure for the policy needs to be constructed. This needs to happen in time 𝒪(|S||A|)𝒪𝑆𝐴\mathcal{O}(|S||A|)caligraphic_O ( | italic_S | | italic_A | ) for each policy-evaluation step. Afterwards, the matrix can be loaded efficiently for each cycle of the measurement protocol.

Classical Subroutine: Policy Improvement. The policy improvement step on a classical device requires reading out the Q𝑄Qitalic_Q-vector from the quantum computer after matrix inversion. Naively, one would expect that the measurement process introduces exponential overhead. However, since convergence results for the Bellman equations are based on the maximum norm (Lsubscript𝐿L_{\infty}italic_L start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT norm), the authors employ Lsubscript𝐿L_{\infty}italic_L start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT-norm state tomography [KP20]. This is efficient, i.e. requires 𝒪(1/ϵ2)𝒪1superscriptitalic-ϵ2\mathcal{O}(1/\epsilon^{2})caligraphic_O ( 1 / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) shots, where ϵitalic-ϵ\epsilonitalic_ϵ now is the target accuracy for the optimal Q𝑄Qitalic_Q-values (under Lsubscript𝐿L_{\infty}italic_L start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT-norm). Consequently, the overall time complexity (neglecting logarithmic terms) of the algorithm is

𝒪(|S||A|+μPΓ/ϵ2).𝒪𝑆𝐴subscript𝜇𝑃Γsuperscriptitalic-ϵ2\mathcal{O}(|S||A|+\mu_{P}\Gamma/\epsilon^{2})\,.caligraphic_O ( | italic_S | | italic_A | + italic_μ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT roman_Γ / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) . (50)

In Eq. 50 the factor 1/ϵ21superscriptitalic-ϵ21/\epsilon^{2}1 / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT appears in the second term since the matrix inversion subroutine is called for each of the 1/ϵ21superscriptitalic-ϵ21/\epsilon^{2}1 / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT shots. The first term is the classical complexity of calculating the argmax𝑎𝑟𝑔𝑚𝑎𝑥argmaxitalic_a italic_r italic_g italic_m italic_a italic_x function for policy improvement and construction of the policy oracle prior to each evaluation step.

Example Environments. The authors consider the FrozenLake and the InvertedPendulum environments as examples. We will briefly discuss the insights from the former here: The simple form of the environment allows choosing μP=1/2subscript𝜇𝑃12\mu_{P}=1/2italic_μ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT = 1 / 2, which thus is independent of the size of the action and state space. Note that the gate complexity is still of the order of |S||A|𝑆𝐴|S||A|| italic_S | | italic_A |. It only becomes efficient for special structured instances of the environment such as all ‘holes’ on the diagonal of the grid.

Quantum advantage. The leading term in Eq. 50 is linear in |S|𝑆|S|| italic_S | and |A|𝐴|A|| italic_A |, showing a speed-up with respect to classical linear-system of equations solvers. These exhibit complexity 𝒪((|S||A|)ω)𝒪superscript𝑆𝐴𝜔\mathcal{O}((|S||A|)^{\omega})caligraphic_O ( ( | italic_S | | italic_A | ) start_POSTSUPERSCRIPT italic_ω end_POSTSUPERSCRIPT ), with ω>1𝜔1\omega>1italic_ω > 1, and vanilla Q𝑄Qitalic_Q-value iteration with complexity 𝒪(|S|2|A|)𝒪superscript𝑆2𝐴\mathcal{O}(|S|^{2}|A|)caligraphic_O ( | italic_S | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_A | ). Even though a more detailed characterization of possible quantum advantage is not provided in Ref. [CKP23], it is clear that the speed up can be at most polynomial.

Least-Squares Policy Iteration. Finally, the authors generalize the method to least-squares policy iteration [LP03], where the Q𝑄Qitalic_Q-vector is approximated by a set of basis functions. For details refer to Refs. [LP03, CKP23].

Quantum Policy Iteration via Amplitude Estimation and Grover Search - Towards Quantum Advantage for Reinforcement Learning, Wiedemann et al. (2022)


Summary. In the QRL scheme proposed in Refs. [Wie+22, Wie21], a policy is evaluated by constructing a superposition of all possible trajectories of an MDP with fixed-horizon and with finite action and state space. Making use of amplitude estimation [Bra+02], the number of calls to a state-transition oracle for estimation of the value function (up to some fixed additive error) can be quadratically reduced. A second algorithm finds the optimal policy in the policy space quadratically faster compared to direct policy search by means of Grover’s algorithm.

First Algorithm. The first algorithm assumes access to a policy oracle ΠΠ\Piroman_Π and an environment oracle E𝐸Eitalic_E which act on an initial state |sket𝑠|s\rangle| italic_s ⟩ as

Π(|s|0𝒜)=aπ(a|s)|s|aΠket𝑠subscriptket0𝒜subscript𝑎𝜋conditional𝑎𝑠ket𝑠ket𝑎\Pi(|s\rangle|0\rangle_{\mathcal{A}})=\sum_{a}\sqrt{\pi(a|s)}|s\rangle|a\rangleroman_Π ( | italic_s ⟩ | 0 ⟩ start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT square-root start_ARG italic_π ( italic_a | italic_s ) end_ARG | italic_s ⟩ | italic_a ⟩
E(|s|a|0|0𝒮)=r,sp(r,s|s,a)|s|a|r|s.𝐸ket𝑠ket𝑎subscriptket0subscriptket0𝒮subscript𝑟superscript𝑠𝑝𝑟conditionalsuperscript𝑠𝑠𝑎ket𝑠ket𝑎ket𝑟ketsuperscript𝑠E(|s\rangle|a\rangle|0\rangle_{\mathcal{R}}|0\rangle_{\mathcal{S}})=\sum_{r,s^% {\prime}}\sqrt{p(r,s^{\prime}|s,a)}|s\rangle|a\rangle|r\rangle|s^{\prime}% \rangle\,.italic_E ( | italic_s ⟩ | italic_a ⟩ | 0 ⟩ start_POSTSUBSCRIPT caligraphic_R end_POSTSUBSCRIPT | 0 ⟩ start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_r , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT square-root start_ARG italic_p ( italic_r , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) end_ARG | italic_s ⟩ | italic_a ⟩ | italic_r ⟩ | italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⟩ .

Applying these operators sequentially on partially fresh registers as shown in Fig. 17 results in a superposition of all possible trajectories

|t=|s0|a0|r1|s1...|rH|sH\ket{t}=|s_{0}\rangle|a_{0}\rangle|r_{1}\rangle|s_{1}\rangle\,.\,.\,.\,|r_{H}% \rangle|s_{H}\rangle| start_ARG italic_t end_ARG ⟩ = | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⟩ | italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⟩ | italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟩ | italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟩ . . . | italic_r start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ⟩ | italic_s start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ⟩

where H𝐻Hitalic_H is the horizon of the MDP, such that the quantum state reads

|ψπ=tpt|t|Gt.ketsuperscript𝜓𝜋subscript𝑡subscript𝑝𝑡ket𝑡ketsubscript𝐺𝑡|\psi^{\pi}\rangle=\sum_{t}\sqrt{p_{t}}|t\rangle|G_{t}\rangle\,.| italic_ψ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ⟩ = ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT square-root start_ARG italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG | italic_t ⟩ | italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ .

Here, ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the probability of trajectory t𝑡titalic_t. An additional unitary operator has been applied that calculates the return Gtsubscript𝐺𝑡G_{t}italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of trajectory t𝑡titalic_t and encodes the value into an additional register entangled with the corresponding trajectory. The superscript π𝜋\piitalic_π on |ψπketsuperscript𝜓𝜋|\psi^{\pi}\rangle| italic_ψ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ⟩ denotes that the state corresponds to the superposition of trajectories for a given policy π𝜋\piitalic_π.

Refer to caption
Figure 17: Sequence of policy and environment operator application to an initial state s𝑠sitalic_s. This constructs a superposition of all possible trajectories for a fixed horizon MDP as shown in Wiedemann et al. [Wie21].

The next step of the algorithm attaches an ancilla qubit. With bit-by-bit rotations of the state the digital encoding of Gtsubscript𝐺𝑡G_{t}italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is transformed into amplitude encoding (assuming here for simplicity Gt[0,1]subscript𝐺𝑡01G_{t}\in[0,1]italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ [ 0 , 1 ]). A simple calculation reveals that the probability of finding the ancilla qubit in state |1ket1|1\rangle| 1 ⟩ is given by the average return, that is the value function of the initial state s𝑠sitalic_s. With this insight in mind, the authors propose amplitude estimation [Bra+02]. This involves the phase-estimation algorithm, to extract the value function. While classically sampling from the superposition of trajectories would require 𝒪(1/ϵ2)𝒪1superscriptitalic-ϵ2\mathcal{O}(1/\epsilon^{2})caligraphic_O ( 1 / italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) preparations of the state, the quantum algorithm achieves the same error with O(1/ϵ)𝑂1italic-ϵO(1/\epsilon)italic_O ( 1 / italic_ϵ ), resulting in a quadratic speed-up. Hereby, ϵitalic-ϵ\epsilonitalic_ϵ denotes the fixed additive error to which the value function is to be determined.

Second Algorithm. The second algorithm shown in Ref. [Wie+22] is a quantum version of direct policy search. The authors propose to create a superposition

1|P|π|π|ψπ1Psubscript𝜋ket𝜋ketsuperscript𝜓𝜋\frac{1}{\sqrt{|\text{P}|}}\sum_{\pi}|\pi\rangle|\psi^{\pi}\rangledivide start_ARG 1 end_ARG start_ARG square-root start_ARG | P | end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT | italic_π ⟩ | italic_ψ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ⟩

where |πket𝜋|\pi\rangle| italic_π ⟩ is a digital representation of the policy, |ψπketsuperscript𝜓𝜋|\psi^{\pi}\rangle| italic_ψ start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ⟩ the superposition of all trajectories corresponding to policy π𝜋\piitalic_π as before, and |P|𝑃|P|| italic_P | the size of the policy space. Quantum minimum finding [DH96] can now be applied to find the optimal policy (the one with maximal expected return starting from initial state s𝑠sitalic_s), requiring only 𝒪(|P|)𝒪𝑃\mathcal{O}(\sqrt{|P|})caligraphic_O ( square-root start_ARG | italic_P | end_ARG ) preparations of the state. This is opposed by O(|P|)𝑂𝑃O(|P|)italic_O ( | italic_P | ) in classical direct policy search. Note, however, that the space of all policies scales as O(|A||S|)𝑂superscript𝐴𝑆O(|A|^{|S|})italic_O ( | italic_A | start_POSTSUPERSCRIPT | italic_S | end_POSTSUPERSCRIPT ), where |A|𝐴|A|| italic_A | and |S|𝑆|S|| italic_S | are the sizes of action and state space, respectively. Consequently, the quantum algorithm scales exponentially worse compared to policy iteration where the Bellman optimality equation is iterated with polynomial complexity in |S|𝑆|S|| italic_S | and |A|𝐴|A|| italic_A |. The method proposed in Refs. [Wie+22, Wie21] therefore should be seen as a quantum version of direct policy search.

4.6 Quantum Reinforcement Learning with Oracularized Environments

In this final section we summarize work that proposes fully quantum-mechanical approaches to QRL. In the articles we survey below, the environment is a quantum system or oracle that can be queried by superpositions of states and actions. Interactions with a quantum-mechanical agent create superpositions of trajectories as input for subroutines like Grover search, quantum-maximum finding, and amplitude estimation. Provable quantum advantage renders some of these proposals interesting candidates for the post-NISQ era.

Citation First Author Title

[DTB16]

V. Dunjko

Quantum-Enhanced Machine Learning

[DTB15]

V. Dunjko

Framework for learning agents in quantum environments

[DTB17]

V. Dunjko

Advances in quantum reinforcement learning

[HDW21]

A. Hamann

Quantum-accessible reinforcement learning beyond strictly epochal environments

[Wan+21a]

D. Wang

Quantum exploration algorithms for multi-armed bandits

[Wan+23]

Z. Wan

Quantum Multi-Armed Bandits and Stochastic Linear Bandits Enjoy Logarithmic Regrets

[Sag+21]

V. Saggio

Experimental quantum speed-up in reinforcement learning agents

[HW22]

A. Hamann

Performance analysis of a hybrid agent for quantum-accessible reinforcement learning

[Cor18]

A. Cornelissen

Quantum gradient estimation and its application to quantum reinforcement learning

Table 30: Work considered for “QRL with Oracularized Environments” (Sec. 4.6)
Quantum-Enhanced Machine Learning, Dunjko et al. (2016) and related work


Summary. In Ref. [DTB16] and in a more detailed preprint [DTB15] a general framework of an agent-environment interaction where both entities are quantum-mechanical systems is developed. To query the environment by a superposition of action states (intuitively the agent learns in parallel), clearly the environment must be modeled by some form of an oracle. As it turns out, this oracularization is much more involved than one might naively think. The focus of the work is therefore:

  • Formalizing a quantum mechanical version of agent-environment interaction

  • Investigation of the classical limit

  • Properties of the general quantum mechanical set-up

  • Treatment of special oracularizable environments

  • Identification of quantum advantage for these environments

General Setup. The interaction between agent and environment is modeled as shown in Fig. 2a in Ref. [DTB15]. The register RAsubscript𝑅𝐴R_{A}italic_R start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT processes the computations of the agent, while the register REsubscript𝑅𝐸R_{E}italic_R start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT represents the environment. The communication register stores one action and one state. The interaction is described by completely positive trace preserving (CPTP) maps or, if we wish, unitary maps on a larger system. The first map M1Esuperscriptsubscript𝑀1𝐸M_{1}^{E}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT outputs the initial state and stores it into the communication register. The map M1Asuperscriptsubscript𝑀1𝐴M_{1}^{A}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT (modeling the agent) reads this state and, after some processing on RAsubscript𝑅𝐴R_{A}italic_R start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, outputs an action state which is added to the communication register. Now this action processed by M2Esuperscriptsubscript𝑀2𝐸M_{2}^{E}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT, which outputs a new state. Consecutively, the previous state in the communication register is overwritten, and so on. The particular form of the states of RCsubscript𝑅𝐶R_{C}italic_R start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT (if in superpositions of action or not) will be discussed later.

While RCsubscript𝑅𝐶R_{C}italic_R start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT only contains a state-action pair, the agent’s register stores all previous states and actions (because the next action proposed by the learning algorithm depends on all actions and states encountered before, note here the distinction between algorithm and policy). The same is true for the (in general non-Markovian) environment.

Next, as shown in Fig. 18, a tester register RTsubscript𝑅𝑇R_{T}italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is introduced, which is designed to ‘observe’ the elapsed history (all encountered states and actions during a learning sequence). This copying from RCsubscript𝑅𝐶R_{C}italic_R start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT to RTsubscript𝑅𝑇R_{T}italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is modeled by controlled unitaries (so they do not modify RCsubscript𝑅𝐶R_{C}italic_R start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT). Each of them act on a fresh part of the register RTsubscript𝑅𝑇R_{T}italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT.

The term copying the register here means that a superposition of computational basis states is concatenated with a second register, on which then each basis state is copied to. This produces in general a highly entangled state, which cannot be factorized into the initial state on the first register and a copy on the second (note the no-cloning theorem only rules out a transformation producing this factorized copy for a general initial state). The most general form of the tester interaction treated in this work allows additional unitary transformations, such that the copying can be described in the form of controlled unitaries. A tester interaction that merely copies the states will be referred to as classical.

After training, the register RTsubscript𝑅𝑇R_{T}italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT contains the sequence of actions and states, the so-called history. Any metric measuring performance of learning can be phrased as a function of the history probabilities. Therefore, it can be formulated as the expectation value of an observable on RTsubscript𝑅𝑇R_{T}italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT.

Refer to caption
Figure 18: Adding a tester as proposed in Dunjko et al. [DTB16].

Classical Limit. For recovering the classical learning set-up, the notion of classical interaction is defined by restricting the form of the maps, such that the state in RARCREsubscript𝑅𝐴subscript𝑅𝐶subscript𝑅𝐸R_{A}-R_{C}-R_{E}italic_R start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT remains separable (note that no entanglement between the registers does not prohibit entangled agent or environment states, thus quantum mechanical environments and agents equipped with a quantum computer are not excluded). Additionally, the tester interaction is supposed to be classical (in the sense as defined above). For this setup it is shown that for every scenario with separable register state there exists a classical environment and a classical agent that produce the same history. Consequently, no quantum improvements are possible. Hence, there can be no improvement in the figure of merit, even when the agent has access to a quantum computer.

General Quantum-Mechanical Set-Up. What happens when we allow general maps and general states on the registers? The authors prove that the state on RTsubscript𝑅𝑇R_{T}italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is still an incoherent mixture, and therefore no quantum advantage can be expected. The reason for this result lies in the memory of agent and possibly the environment: The agent in general has to remember all previous encountered states and actions, because the learning algorithm run by the agent is a function of that particular elapsed history. The quantum state therefore is a superposition of histories entangled with a state, which describes the agent that has seen this particular history. The states of this agent are orthogonal, since a different agent state translates into a different bit state of the memory. Thus, when tracing out these degrees of freedoms, the resulting reduced density matrix on RTsubscript𝑅𝑇R_{T}italic_R start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is an incoherent mixture and no quantum advantage can be achieved in the figure of merit. (Side remark: This does not exclude a quantum advantage in terms of computational complexity in the internal processing of the agent. The result is about exploiting the ‘quantumness’ of the environment-agent interaction)

We note that one has to be careful with the interpretation of density matrices. One might be inclined to think that an incoherent mixture of history states weighted by their probability in some sense corresponds to traversing all of the histories simultaneously but note that the correct expectation value with respect to this density matrix is only obtained in the limit of infinitely many runs corresponding to sampling trajectories one after another.

Oracularization of Environments. The next part of the work focuses on a special class of environments and learning setting without memory, which overcome the decoherence problem. These oracularized environments are of the following form:

  • episodic with fixed horizon \rightarrow fixed sequence of interactions

  • deterministic \rightarrow action sequence fully determines the history, states can be disregarded

  • binary rewards issued at final state \rightarrow allows use of Grover search

Quantum Advantage. With these assumptions a proper oracle can be constructed, that can be queried with a superposition of actions. This allows to use it as a phase flip oracle, as in the Deutsch-Jozsa or Grover algorithm. The time required for finding a rewarded-action sequence is therefore quadratically reduced. Consequently, this setting is meaningful for learning tasks, where the reward is very sparse. That is, the agent cannot learn until it has first seen a reward. After this initial exploration phase, the agent can now be further trained in simulation. Finally, some of the assumptions are relaxed. The authors also show, how stochastic oracles can be constructed.

Further Work. There is further work that builds upon the results of Refs. [DTB15, DTB16]. In Ref. [DTB17], the algorithm is applied to the optimization of parameters describing the properties of the agent (hyperparameter). It also discusses the notion of register hijacking, where the agent has access to hidden memory registers of the environment. This assumption allows the oracularization of more general environments, which is also discussed in Ref. [Dun+18]. This class is further generalized in Ref. [HDW21] beyond episodic environments. A closer investigation of amplitude amplification techniques for the special case of multi-armed bandits environments is conducted in Refs. [Wan+21a, Wan+23]. In Ref. [Sag+21], the learning setting is implemented experimentally for a two-qubit system and an experimental quantum advantage is observed. Finally, the performance of an agent in this setting is investigated in Ref. [HW22].

Quantum gradient estimation and its application to quantum reinforcement learning, Cornelissen (2018)


Summary. The master’s thesis [Cor18] considers model-based RL and develops quantum algorithms for policy evaluation and policy optimization. For the former method a quadratic improvement in sample complexity is found.

Quantum Policy Evaluation. A quantum algorithm for quantum policy evaluation is presented in Sec. 6.2 of the thesis and will be summarized in the following: The algorithm is executed on a register that is capable to store T𝑇Titalic_T states and actions of a T𝑇Titalic_T-step MDP. To generate a sequence, a transition-probability oracle and a policy oracle are defined. They generate a superposition of all possible action-state sequences of the Markov problem, weighted by the square root of the corresponding probabilities. Note that the state is normalized as the probabilities sum up to one. Next, a reward oracle is defined which, when acting on a state-action pair, multiplies the state with a phase factor. The phase is the discounted reward for this state action pair. The discount factors are introduced by making use of fractional phase oracles. This is discussed in detail in Sec. 4 and 5 of the thesis, which are based on Refs. [GAW19, Gil+19]. The fractional reward oracle is applied to every state-action pair in the register, resulting in the product of phase factors containing the individual discounted rewards. Thus, when merging the exponentials to one exponential, the full quantum state is a superposition of all state-action sequences, weighted by the square root of the individual probability and a phase factor containing the corresponding return. Next, it is shown how the phase factor can be encoded in the amplitude by a controlled operation on an ancilla qubit. Consequently, the probability of measuring the ancilla in, say, state |0ket0\ket{0}| start_ARG 0 end_ARG ⟩ is given by the expectation value of the return, that is the value function. It can be measured using quantum-amplitude estimation, which works based on the phase estimation algorithm. The amplitude-estimation algorithm is a Grover-type algorithm. Hence, it is not surprising that the quadratic speed up compared to classical Monte-Carlo sampling results from this algorithmic step.

Quantum Policy Optimization. In Sec. 6.4 of the thesis a policy optimization algorithm is developed. This method can be seen as a quantum analogue of policy gradient. First of all, the policy needs to be parameterized. This is done by introducing the parameters xsasubscript𝑥𝑠𝑎x_{sa}italic_x start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT such that π(a|s)=xsa𝜋conditional𝑎𝑠subscript𝑥𝑠𝑎\pi(a|s)=x_{sa}italic_π ( italic_a | italic_s ) = italic_x start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT for all a𝑎aitalic_a but one arbitrarily chosen a*superscript𝑎a^{*}italic_a start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT and π(a*|s)=1axsa𝜋conditionalsuperscript𝑎𝑠1subscript𝑎subscript𝑥𝑠𝑎\pi(a^{*}|s)=1-\sum_{a}x_{sa}italic_π ( italic_a start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT | italic_s ) = 1 - ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT otherwise. By that definition, the policy is properly normalized and all xsa[0,1]subscript𝑥𝑠𝑎01x_{sa}\in[0,1]italic_x start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT ∈ [ 0 , 1 ]. Consequently, the expected return is a high-dimensional polynomial in the parameters xsasubscript𝑥𝑠𝑎x_{sa}italic_x start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT. For taking the derivative of this objective, Jordan’s quantum gradient algorithm [Jor05] in it’s advanced form [GAW19] is employed. This leads to a finite-difference approximation of the gradients, written in a phase factor, which can be read out after applying phase estimation. Following Ref. [Gil+19], significant amount of work is devoted to transform the probability oracle for the policy and the transition matrix described above into a phase oracle. Once the superposition of state-action sequences is prepared, an oracle call multiplies each state in the superposition by the corresponding discounted reward. Consecutively, the gradient estimation algorithm is applied and the gradients can be read out. This step is followed by adapting the policy through gradient ascent. It is concluded in the thesis that this policy optimization algorithm does not necessarily lead to quantum speed-up. However, as the author argues, it is conceivable that improvement of the algorithm might lead to a quantum speed-up.

5 Outlook

We have given a rather detailed account of the various instances QRL that have appeared throughout the literature. We observed, that the dichotomy found at the hardware level, i.e., currently available NISQ devices vs. fault-tolerant and error-corrected QPUs, manifests also at the algorithmic level.

With NISQ devices in mind, VQCs have been suggested as function approximators. These replace their classical counterparts in RL algorithms with function approximation in policy space, value space, or both. Here, one typically replaces a classical learning heuristic by a learning heuristic with a quantum component. Any sort of potential quantum advantage, however, is not immediately apparent. We eventually can obtain theoretical insight into the properties of VQCs viewed as ML models and function approximators. However, a direct comparison to their classical cousins, such as neural networks, is anything but easy and might strongly depend on the chosen metric. How can we meaningfully deploy an agent trained with VQC-components? What are the requirements for quantum advantage in such a heuristic setting? What does non-simulability of quantum circuits imply for e.g. generalization bounds of VQCs as ML models? Can we scale VQCs while maintaining their desirable properties? What is the intrinsic inductive bias of VQCs viewed as ML models? What are the implications for RL and its application domains? All these questions are currently being investigated in the research community, and we are looking forward to new results.

While quantum algorithms for fault-tolerant and error-corrected QPUs have been put forward, we are still far from being able to deploy these algorithms for meaningful problem sizes. Given the necessary advancements of hardware platforms, it will be exciting to see whether these types of quantum algorithms will become competitive with classical learning approaches in practice.

We hope that our survey on the QRL literature and the various types of QRL algorithms will help guide newcomers to the field and will serve as a valuable reference for researchers.

Acknowledgments

We acknowledge collaboration and exchange with M. Franz, L. Wolf, M. Schönberger and W. Mauerer as well as M. J. Hartmann on the topic of quantum reinforcement learning. We further acknowledge exchange and discussion with W. Hauptmann, D. Hein, S. Udluft, V. Tresp, Y. Ma, A. Auer, M. Weber, B. Bisgin, L. Bleiziffer, C. Mendl, S. Wiedemann, S. Wölk, J. M. Lorenz, M. Monnet, T.-A. Dragan, G. Kruse and G. Kontes. We would like to thank M. Leib for feedback on an early version of the manuscript. This work was supported by the German Federal Ministry of Education and Research (BMBF), funding program “quantum technologies – from basic research to market”, grant number 13N15645.

Acronyms

BCQ
batch-constrained deep Q𝑄Qitalic_Q-learning
BCQQ
batch-constrained quantum Q𝑄Qitalic_Q-learning
CNN
convolutional neural network
CPTP
completely positive trace preserving
CQ2L
conservative quantum Q𝑄Qitalic_Q-learning
CQL
conservative Q𝑄Qitalic_Q-learning
CTDE
centralized training with decentralized execution
DDQL
double deep Q𝑄Qitalic_Q-learning
DL
deep learning
DLP
discrete logarithm problem
DNN
deep neural network
DQAS
differential quantum architecture search
DQL
deep Q𝑄Qitalic_Q-learning
DQN
deep Q𝑄Qitalic_Q-network
DRL
deep reinforcement learning
DRU
data re-uploading
FIM
Fisher information matrix
MARL
multi-agent reinforcement learning
MDP
Markov decision process
ML
machine learning
MPS
matrix product state
MSE
mean square error
NISQ
noisy intermediate-scale quantum
NN
neural network
PDF
probability density function
POMDP
partially observable Markov decision process
PPO
proximal policy optimization
PS
projective simulation
QA3C
quantum asynchronous advantage actor critic
QC
quantum computing
QCNN
quantum convolutional neural network
QDDPG
quantum deep deterministic policy gradient
QiRL
quantum-inspired reinforcement learning
QLSTM
quantum long short-term memory
QMARL
quantum multi-agent reinforcement learning
QML
quantum machine learning
QNN
quantum neural network
QNPG
quantum natural policy gradient
QPG
quantum policy gradient
QPU
quantum processing unit
QRL
quantum reinforcement learning
QRNN
quantum recurrent neural network
RL
reinforcement learning
SAC
soft actor-critic
TD
temporal difference
TN
tensor network
TSP
traveling salesman problem
VQA
variational quantum algorithm
VQC
variational quantum circuit
VQ-DQN
variational quantum deep Q𝑄Qitalic_Q-networks
VQE
variational quantum eigensolver
VRP
vehicle routing problem

References

  • [Abb+21] Amira Abbas et al. “The power of quantum neural networks” In Nat. Comput. Sci. 1.6, 2021, pp. 403–409 DOI: 10.1038/s43588-021-00084-1
  • [ACN22] Eva Andrés, Manuel Pegalajar Cuéllar and Gabriel Navarro “On the use of quantum reinforcement learning in energy-efficiency scenarios” In Energies 15.16, 2022, pp. 6034 DOI: 10.3390/en15166034
  • [ACN23] Eva Andrés, MP Cuellar and G Navarro “Efficient Dimensionality Reduction Strategies for Quantum Reinforcement Learning” In IEEE Access 11, 2023, pp. 104534–104553 DOI: 10.1109/ACCESS.2023.3318173
  • [Acu+22] Alberto Acuto et al. “Variational quantum soft actor-critic for robotic arm control” In arXiv:2212.11681, 2022 DOI: 10.48550/arXiv.2212.11681
  • [AHF20] Ramin Ayanzadeh, Milton Halem and Tim Finin “Reinforcement Quantum Annealing: A Hybrid Quantum Learning Automata” In Sci. Rep. 10.1, 2020, pp. 1–11 DOI: 10.1038/s41598-020-64078-1
  • [Alb+18] Francisco Albarrán-Arriagada, Juan C Retamal, Enrique Solano and Lucas Lamata “Measurement-based adaptation protocol with quantum reinforcement learning” In Phys. Rev. A 98.4, 2018, pp. 042315 DOI: 10.1103/PhysRevA.98.042315
  • [Alb+20] Francisco Albarrán-Arriagada, Juan Carlos Retamal, Enrique Solano and Lucas Lamata “Reinforcement learning for semi-autonomous approximate quantum eigensolver” In Mach. learn.: sci. technol. 1.1, 2020, pp. 015002 DOI: 10.1088/2632-2153/ab43b4
  • [Alv+16] Unai Alvarez-Rodriguez, Mikel Sanz, Lucas Lamata and Enrique Solano “Artificial Life in Quantum Technologies” In Sci. Rep. 6.1, 2016, pp. 1–9 DOI: 10.1038/srep20956
  • [Alv+18] Unai Alvarez-Rodriguez, Mikel Sanz, Lucas Lamata and Enrique Solano “Quantum Artificial Life in an IBM Quantum Computer” In Sci. Rep. 8.1, 2018, pp. 1–9 DOI: 10.1038/s41598-018-33125-3
  • [Ama98] Shun-Ichi Amari “Natural Gradient Works Efficiently in Learning” In Neural Comput. 10.2, 1998, pp. 251–276 DOI: 10.1162/089976698300017746
  • [Amb+19] Andris Ambainis et al. “Quantum Speedups for Exponential-Time Dynamic Programming Algorithms” In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, 2019, pp. 1783–1793 DOI: 10.1137/1.9781611975482.107
  • [Ans+23] James Adu Ansere et al. “Quantum Deep Reinforcement Learning for Dynamic Resource Allocation in Mobile Edge Computing-based IoT Systems” In IEEE Trans. Wirel. Commun., 2023 DOI: 10.1109/TWC.2023.3330868
  • [AOM17] Mohammad Gheshlaghi Azar, Ian Osband and Rémi Munos “Minimax Regret Bounds for Reinforcement Learning” In Proceedings of the 34th International Conference on Machine Learning 70, 2017, pp. 263–272 URL: https://dl.acm.org/doi/10.5555/3305381.3305409
  • [Aru+17] Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage and Anil Anthony Bharath “Deep reinforcement learning: A brief survey” In IEEE Signal Process. Mag. 34.6, 2017, pp. 26–38 DOI: 10.1109/MSP.2017.2743240
  • [Aru+19] Frank Arute et al. “Quantum Supremacy using a Programmable Superconducting Processor” In Nature 574, 2019, pp. 505–510 DOI: 10.1038/s41586-019-1666-5
  • [BAQ23] BAQIS Quafu Group “Quafu-RL: The Cloud Quantum Computers based Quantum Reinforcement Learning” In arXiv:2305.17966, 2023 DOI: 10.48550/arXiv.2305.17966
  • [BBA14] Jennifer Barry, Daniel T. Barry and Scott Aaronson “Quantum partially observable Markov decision processes” In Phys. Rev. A 90.3, 2014, pp. 032311 DOI: 10.1103/PhysRevA.90.032311
  • [BD12] Hans J. Briegel and Gemma De las Cuevas “Projective simulation for artificial intelligence” In Sci. Rep. 2.1, 2012, pp. 1–16 DOI: 10.1038/srep00400
  • [Bel+20] Dmitrii Beloborodov et al. “Reinforcement learning enhanced quantum-inspired algorithm for combinatorial optimization” In Mach. Learn.: Sci. Technol. 2.2, 2020, pp. 025009 DOI: 10.1088/2632-2153/abc328
  • [Bel57] Richard Bellman “A Markovian decision process” In J. math. mech. 6.5, 1957, pp. 679–684 URL: https://www.jstor.org/stable/24900506
  • [Ben+20] Marcello Benedetti, Erika Lloyd, Stefan Sack and Mattia Fiorentini “Parameterized quantum circuits as machine learning models” In Quantum Sci. Technol. 5, 2020, pp. 019601 DOI: 10.1088/2058-9565/ab4eb5
  • [Ben80] Paul Benioff “The computer as a physical system: A microscopic quantum mechanical Hamiltonian model of computers as represented by Turing machines” In J. Stat. Phys. 22, 1980, pp. 563–591 DOI: 10.1007/BF01011339
  • [Bha+19] Kishor Bharti, Tobias Haug, Vlatko Vedral and Leong-Chuan Kwek “How to Teach AI to Play Bell Non-Local Games: Reinforcement Learning” In arXiv:1912.10783, 2019 DOI: 10.48550/arXiv.1912.10783
  • [BKS23] Simon Buchholz, Jonas M Kübler and Bernhard Schölkopf “Multi armed bandits and quantum channel oracles” In arXiv:2301.08544, 2023 DOI: 10.48550/arXiv.2301.08544
  • [BLT23] Shrigyan Brahmachari, Josep Lumbreras and Marco Tomamichel “Quantum contextual bandits and recommender systems for quantum data” In arXiv:2301.13524, 2023 DOI: 10.48550/arXiv.2301.13524
  • [Boy+20] Walter L. Boyajian et al. “On the convergence of projective-simulation–based reinforcement learning in Markov decision processes” In Quantum Mach. Intell. 2.13, 2020, pp. 1–21 DOI: 10.1007/s42484-020-00023-9
  • [Bra+02] Gilles Brassard, Peter Hoyer, Michele Mosca and Alain Tapp “Quantum amplitude amplification and estimation” In Contemp. Math. 305, 2002, pp. 53–74 URL: http://www.ams.org/books/conm/305/
  • [BYK22] Niyazi Furkan Bar, Hasan Yetis and Mehmet Karakose “An Approach Based on Quantum Reinforcement Learning for Navigation Problems” In 2022 International Conference on Data Analytics for Business and Industry (ICDABI), 2022, pp. 593–597 DOI: 10.1109/ICDABI56818.2022.10041570
  • [Cár+18] Francisco A Cárdenas-López, Lucas Lamata, Juan Carlos Retamal and Enrique Solano “Multiqubit and multilevel quantum reinforcement learning with quantum technologies” In PloS one 13.7, 2018, pp. e0200455 DOI: 10.1371/journal.pone.0200455
  • [CCC23] Hao-Yuan Chen, Yen-Jui Chang and Ching-Ray Chang “Deep-Q Learning with Hybrid Quantum Neural Network on Solving Maze Problems” In arXiv:2304.10159, 2023 DOI: 10.48550/arXiv.2304.10159
  • [CCL19] Iris Cong, Soonwon Choi and Mikhail D. Lukin “Quantum convolutional neural networks” In Nat. Phys. 15.12, 2019, pp. 1273–1278 DOI: 10.1038/s41567-019-0648-8
  • [CD08] Chun-Lin Chen and Daoyi Dong “Superposition-Inspired Reinforcement Learning and Quantum Reinforcement Learning” In Reinforcement Learning, 2008 DOI: 10.5772/5275
  • [CD10] Chunlin Chen and Daoyi Dong “Complexity analysis of quantum reinforcement learning” In Proceedings of the 29th Chinese Control Conference, 2010, pp. 5897–5901 URL: https://ieeexplore.ieee.org/abstract/document/5572589
  • [CDC06] Chun-Lin Chen, Daoyi Dong and Zonghai Chen “Quantum computation for action selection using reinforcement learning” In Int. J. Quantum Inf. 4.06, 2006, pp. 1071–1083 DOI: 10.1142/S0219749906002419
  • [Cer+21] Marco Cerezo et al. “Variational quantum algorithms” In Nat. Rev. Phys. 3.9, 2021, pp. 625–644 DOI: 10.1038/s42254-021-00348-9
  • [CFD12] Chen Chunlin, Jiang Frank and Dong Daoyi “Hybrid control of uncertain quantum systems via fuzzy estimation and quantum reinforcement learning” In Proceedings of the 31st Chinese Control Conference, 2012, pp. 7177–7182 URL: https://ieeexplore.ieee.org/abstract/document/6391208
  • [CGJ19] Shantanav Chakraborty, András Gilyén and Stacey Jeffery “The Power of Block-Encoded Matrix Powers: Improved Regression Techniques via Faster Hamiltonian Simulation” In 46th International Colloquium on Automata, Languages, and Programming (ICALP 2019) 132, 2019, pp. 33:1–33:14 DOI: 10.4230/LIPIcs.ICALP.2019.33
  • [Che+06] Chunlin Chen, Daoyi Dong, Yu Dong and Qiong Shi “A quantum reinforcement learning method for repeated game theory” In 2006 International Conference on Computational Intelligence and Security 1, 2006, pp. 68–72 DOI: 10.1109/ICCIAS.2006.294092
  • [Che+19] Chih-Chieh Chen, Shiue-Yuan Shiau, Ming-Feng Wu and Yuh-Renn Wu “Hybrid classical-quantum linear solver using Noisy Intermediate-Scale Quantum machines” In Sci. Rep. 9.1, 2019, pp. 1–12 DOI: 10.1038/s41598-019-52275-6
  • [Che+20] Samuel Yen-Chi Chen et al. “Variational Quantum Circuits for Deep Reinforcement Learning” In IEEE Access 8, 2020, pp. 141007–141024 DOI: 10.1109/access.2020.3010470
  • [Che+21] Samuel Yen-Chi Chen, Chih-Min Huang, Chia-Wei Hsing and Ying-Jer Kao “An end-to-end trainable hybrid classical-quantum classifier” In Mach. learn.: sci. technol. 2.4, 2021, pp. 045021 DOI: 10.1088/2632-2153/ac104d
  • [Che+22] Samuel Yen-Chi Chen et al. “Variational quantum reinforcement learning via evolutionary optimization” In Mach. learn.: sci. technol. 3.1, 2022, pp. 015025 DOI: 10.1088/2632-2153/ac4559
  • [Che+23] Zhihao Cheng, Kaining Zhang, Li Shen and Dacheng Tao “Offline quantum reinforcement learning in a conservative manner” In Proceedings of the AAAI Conference on Artificial Intelligence 37.6, 2023, pp. 7148–7156 DOI: 10.1609/aaai.v37i6.25872
  • [Che+23a] Zhihao Cheng, Kaining Zhang, Li Shen and Dacheng Tao “Quantum Imitation Learning” In IEEE Trans. Neural Netw. Learn. Syst., 2023, pp. 1–15 DOI: 10.1109/TNNLS.2023.3275075
  • [Che+23b] El Amine Cherrat et al. “Quantum Deep Hedging” In Quantum 7, 2023, pp. 1191 DOI: 10.22331/q-2023-11-29-1191
  • [Che10] Ran Cheng “Quantum Geometric Tensor (Fubini-Study Metric) in Simple Quantum System: A pedagogical Introduction” In arXiv:1012.1337, 2010 DOI: 10.48550/arXiv.1012.1337
  • [Che23] Samuel Yen-Chi Chen “Asynchronous training of quantum reinforcement learning” In arXiv:2301.05096, 2023 DOI: 10.48550/arXiv.2301.05096
  • [Che23a] Samuel Yen-Chi Chen “Efficient quantum recurrent reinforcement learning via quantum reservoir computing” In arXiv:2309.07339, 2023 DOI: 10.48550/arXiv.2309.07339
  • [Che23b] Samuel Yen-Chi Chen “Quantum Deep Q-Learning with Distributed Prioritized Experience Replay” In IEEE International Conference on Quantum Computing and Engineering (QCE) 2, 2023, pp. 31–35 DOI: 10.1109/QCE57702.2023.10180
  • [Che23c] Samuel Yen-Chi Chen “Quantum deep recurrent reinforcement learning” In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5 DOI: 10.1109/ICASSP49357.2023.10096981
  • [Che23d] Samuel Yen-Chi Chen “Quantum Reinforcement Learning for Quantum Architecture Search” In Proceedings of the 2023 International Workshop on Quantum Classical Cooperative, 2023, pp. 17–20 DOI: 10.1145/3588983.3596692
  • [Cho+23] Byung** Cho, Yu Xiao, Pan Hui and Daoyi Dong “Quantum bandit with amplitude amplification exploration in an adversarial environment” In IEEE Transactions on Knowledge and Data Engineering, 2023 DOI: 10.1109/TKDE.2023.3279207
  • [CKP23] El Amine Cherrat, Iordanis Kerenidis and Anupam Prakash “Quantum reinforcement learning via policy iteration” In Quantum Mach. Intell. 5.2, 2023, pp. 30 DOI: 10.1007/s42484-023-00116-1
  • [CKS17] Andrew M. Childs, Robin Kothari and Rolando D. Somma “Quantum Algorithm for Systems of Linear Equations with Exponentially Improved Dependence on Precision” In SIAM J. Comput. 46.6, 2017, pp. 1920–1950 DOI: 10.1137/16M1087072
  • [Cob23] Joyce G.H. Cobussen “Quantum Reinforcement Learning for Sensor-Assisted Robot Navigation Tasks”, 2023 URL: https://lup.lub.lu.se/student-papers/search/publication/9141398
  • [Cor+23] Randall Correll et al. “Quantum Neural Networks for a Supply Chain Logistics Application” In Adv. Quantum Technol. 6.7, 2023, pp. 2200183 DOI: 10.1002/qute.202200183
  • [Cor18] Arjan Cornelissen “Quantum gradient estimation and its application to quantum reinforcement learning”, 2018 URL: https://repository.tudelft.nl/islandora/object/uuid:26fe945f-f02e-4ef7-bdcb-0a2369eb867e
  • [Cra+18] Daniel Crawford et al. “Reinforcement Learning Using Quantum Boltzmann Machines” In Quantum Inf. Comput. 18.1–2, 2018, pp. 51–74 URL: https://www.rintonpress.com/journals/doi/QIC18.1-2-3.html
  • [CRC23] James Chao, Ramiro Rodriguez and Sean Crowe “Quantum Enhancements for AlphaZero” In Proceedings of the Companion Conference on Genetic and Evolutionary Computation, 2023, pp. 2179–2186 DOI: 10.1145/3583133.3596302
  • [Cro19] Gavin E. Crooks “Gradients of parameterized quantum gates using the parameter-shift rule and gate decomposition” In arXiv:1905.13311, 2019 DOI: 10.48550/arXiv.1905.13311
  • [ÇY23] Ercan Çağlar and İhsan Yilmaz “Secure Communication Based On Key Generation With Quantum Reinforcement Learning” In Int. J. Inf. Secur. 12.2, 2023, pp. 22–41 DOI: 10.55859/ijiss.1264169
  • [CYF22] Samuel Yen-Chi Chen, Shinjae Yoo and Yao-Lung L Fang “Quantum long short-term memory” In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 8622–8626 IEEE
  • [Dal+20] Mogens Dalgaard, Felix Motzoi, Jens Jakob Sørensen and Jacob Sherson “Global optimization of quantum dynamics with AlphaZero deep exploration” In NPJ Quantum Inf. 6.1, 2020, pp. 1–9 DOI: 10.1038/s41534-019-0241-0
  • [Dal+22] Nicola Dalla Pozza, Lorenzo Buffoni, Stefano Martina and Filippo Caruso “Quantum reinforcement learning: the maze problem” In Quantum Mach. Intell. 4.1, 2022, pp. 1–10 DOI: 10.1007/s42484-022-00068-y
  • [DFB15] Vedran Dunjko, Nicolai Friis and Hans J Briegel “Quantum-enhanced deliberation of learning agents using trapped ions” In New J. Phys. 17.2, 2015, pp. 023006 DOI: 10.1088/1367-2630/17/2/023006
  • [DH96] Christoph Durr and Peter Hoyer “A quantum algorithm for finding the minimum” In arXiv:quant-ph/9607014, 1996
  • [DJ92] D. Deutsch and R. Jozsa “Rapid Solution of Problems by Quantum Computation” In Proc. R. Soc. Lond. 439.1907, 1992 DOI: 10.1098/rspa.1992.0167
  • [Don+06] Daoyi Dong, Chun-Lin Chen, Zonghai Chen and Chen-Bin Zhang “Quantum mechanics helps in learning for more intelligent robots” In Chinese Phys. Lett. 23.7, 2006, pp. 1691 DOI: 10.1088/0256-307X/23/7/010
  • [Don+06a] Daoyi Dong, Chunlin Chen, Chenbin Zhang and Zonghai Chen “Quantum robot: structure, algorithms and applications” In Robotica 24.4, 2006, pp. 513–521 DOI: 10.1017/S0263574705002596
  • [Don+08] Daoyi Dong et al. “Incoherent control of quantum systems with wavefunction-controllable subspaces via quantum reinforcement learning” In IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 38.4, 2008, pp. 957–962 DOI: 10.1109/TSMCB.2008.926603
  • [Don+08a] Daoyi Dong, Chun-Lin Chen, Han-Xiong Li and Tzyh-Jong Tarn “Quantum reinforcement learning” In IEEE Trans. Syst. Man Cybern., Part B (Cybernetics) 38.5, 2008, pp. 1207–1220 DOI: 10.1109/TSMCB.2008.925743
  • [Don+12] Daoyi Dong, Chun-Lin Chen, Jian Chu and Tzyh-Jong Tarn “Robust Quantum-Inspired Reinforcement Learning for Robot Navigation” In IEEE/ASME Trans Mechatron 17.1, 2012, pp. 86–97 DOI: 10.1109/TMECH.2010.2090896
  • [Dră+22] Theodora-Augustina Drăgan, Maureen Monnet, Christian B Mendl and Jeanette Miriam Lorenz “Quantum Reinforcement Learning for Solving a Stochastic Frozen Lake Environment and the Impact of Quantum Architecture Choices” In arXiv:2212.07932, 2022 DOI: 10.48550/arXiv.2212.07932
  • [DS22] Li Ding and Lee Spector “Evolutionary quantum architecture search for parametrized quantum circuits” In Proceedings of the Genetic and Evolutionary Computation Conference Companion, 2022, pp. 2190–2195 DOI: 10.1145/3520304.3534012
  • [DS23] Li Ding and Lee Spector “Multi-Objective Evolutionary Architecture Search for Parameterized Quantum Circuits” In Entropy 25.1, 2023, pp. 93–105 DOI: 10.3390/e25010093
  • [DTB15] Vedran Dunjko, Jacob M Taylor and Hans J Briegel “Framework for learning agents in quantum environments” In arXiv:1507.08482, 2015 URL: https://arxiv.longhoe.net/abs/1507.08482
  • [DTB16] Vedran Dunjko, Jacob M. Taylor and Hans J. Briegel “Quantum-Enhanced Machine Learning” In Phys. Rev. Lett. 117.13, 2016, pp. 130501 DOI: 10.1103/PhysRevLett.117.130501
  • [DTB17] Vedran Dunjko, Jacob M Taylor and Hans J Briegel “Advances in quantum reinforcement learning” In 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2017, pp. 282–287 DOI: 10.1109/SMC.2017.8122616
  • [Dun+18] Vedran Dunjko, Yi-Kai Liu, Xingyao Wu and Jacob M Taylor “Exponential improvements for quantum-accessible reinforcement learning” In arXiv:1710.11160, 2018 DOI: 10.48550/arXiv.1710.11160
  • [EGW05] Damien Ernst, Pierre Geurts and Louis Wehenkel “Tree-based batch mode reinforcement learning” In J. Mach. Learn. Res. 6, 2005, pp. 503–556 URL: http://jmlr.org/papers/v6/ernst05a.html
  • [Fak+13] Pegah Fakhari, Karthikeyan Rajagopal, SN Balakrishnan and JR Busemeyer “Quantum inspired reinforcement learning in changing environment” In New Math. Nat. Comput. 9.03, 2013, pp. 273–294 DOI: 10.1142/S1793005713400073
  • [Fey82] Richard P. Feynman “Simulating physics with computers” In Int. J. Theor. Phys. 21.6/7, 1982, pp. 467–488 DOI: 10.1007/BF02650179
  • [FH23] Jesús Fernández-Villaverde and Isaiah J Hull “Dynamic Programming on a Quantum Annealer: Solving the RBC Model”, 2023 DOI: 10.3386/w31326
  • [Fla+20] Fulvio Flamini et al. “Photonic architecture for reinforcement learning” In New J. Phys. 22.4, 2020, pp. 045002 DOI: 10.1109/PN50013.2020.9166962
  • [Fla+23] Fulvio Flamini et al. “Reinforcement learning and decision making via single-photon quantum walks” In arXiv:2301.13669, 2023 DOI: 10.48550/arXiv.2301.13669
  • [Fös+18] Thomas Fösel, Petru Tighineanu, Talitha Weiss and Florian Marquardt “Reinforcement Learning with Neural Networks for Quantum Feedback” In Phys. Rev. X 8.3, 2018, pp. 031084 DOI: 10.1103/PhysRevX.8.031084
  • [FP+23] Getahun Fikadu Tilaye and Amit Pandey “Investigating the effects of hyperparameters in quantum-enhanced deep reinforcement learning” In Quantum Eng. 2023, 2023 DOI: 10.1155/2023/2451990
  • [Fra+22] Maja Franz et al. “Uncovering instabilities in variational-quantum deep Q-networks” In J. Franklin Inst., 2022 DOI: 10.1016/j.jfranklin.2022.08.021
  • [Fuj+19] Scott Fujimoto, Edoardo Conti, Mohammad Ghavamzadeh and Joelle Pineau “Benchmarking batch deep reinforcement learning algorithms” In arXiv:1910.01708, 2019 DOI: 10.48550/arXiv.1910.01708
  • [GA23] Bhargav Ganguly and Vaneet Aggarwal “Quantum Acceleration of Infinite Horizon Average-Reward Reinforcement Learning” In arXiv:2310.11684, 2023 DOI: 10.48550/arXiv.2310.11684
  • [Gan+23] Bhargav Ganguly, Yulian Wu, Di Wang and Vaneet Aggarwal “Quantum Computing Provides Exponential Regret Improvement in Episodic Reinforcement Learning” In arXiv:2302.08617, 2023 DOI: 10.48550/arXiv.2302.08617
  • [GAW19] András Gilyén, Srinivasan Arunachalam and Nathan Wiebe “Optimizing quantum optimization algorithms via faster quantum gradient computation” In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, 2019, pp. 1425–1444 DOI: 10.1137/1.9781611975482.87
  • [GB10] Xavier Glorot and Yoshua Bengio “Understanding the difficulty of training deep feedforward neural networks” In J. Mach. Learn. Res. 9, 2010, pp. 249–256 URL: https://proceedings.mlr.press/v9/glorot10a.html
  • [GH19] Michael Ganger and Wei Hu “Quantum Multiple Q-Learning” In International Journal of Intelligence Science 9.01, 2019, pp. 1–22 DOI: 10.4236/ijis.2019.91001
  • [Gil+19] András Gilyén, Yuan Su, Guang Hao Low and Nathan Wiebe “Quantum singular value transformation and beyond: exponential improvements for quantum matrix arithmetics” In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, 2019, pp. 193–204 DOI: 10.1145/3313276.3316366
  • [Gra+19] Edward Grant, Leonard Wossnig, Mateusz Ostaszewski and Marcello Benedetti “An initialization strategy for addressing barren plateaus in parametrized quantum circuits” In Quantum 3, 2019, pp. 214 DOI: 10.22331/q-2019-12-09-214
  • [Haa+17] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel and Sergey Levine “Reinforcement learning with deep energy-based policies” In Proceedings of Machine Learning Research 70, 2017, pp. 1352–1361 URL: https://proceedings.mlr.press/v70/haarnoja17a.html
  • [Haa+18] Tuomas Haarnoja et al. “Soft actor-critic algorithms and applications” In arXiv:1812.05905, 2018 DOI: 10.48550/arXiv.1812.05905
  • [Ham21] Yassine Hamoudi “Quantum Sub-Gaussian Mean Estimator” In 29th Annual European Symposium on Algorithms (ESA 2021) 204, 2021, pp. 50:1–50:17 DOI: 10.4230/LIPIcs.ESA.2021.50
  • [Has10] Hado Hasselt “Double Q-learning” In NeurIPS 23.2, 2010, pp. 2613–2621 URL: https://proceedings.neurips.cc/paper/2010/hash/091d584fced301b442654dd8c23b3fc9-Abstract.html
  • [HDW21] Arne Hamann, Vedran Dunjko and Sabine Wölk “Quantum-accessible reinforcement learning beyond strictly epochal environments” In Quantum Mach. Intell. 3.22, 2021, pp. 1–18 DOI: 10.1007/s42484-021-00049-7
  • [Hei+22] Dirk Heimann, Hans Hohenfeld, Felix Wiebe and Frank Kirchner “Quantum deep reinforcement learning for robot navigation tasks” In arXiv:2202.12180, 2022 DOI: 10.48550/arXiv.2202.12180
  • [HH19] Wei Hu and James Hu Q𝑄Qitalic_Q Learning with Quantum Neural Networks” In Natural Science 11.01, 2019, pp. 31–39 DOI: 10.4236/ns.2019.111005
  • [HH19a] Wei Hu and James Hu “Distributional Reinforcement Learning with Quantum Neural Networks” In Intelligent Control and Automation 10.02, 2019, pp. 63–78 DOI: 10.4236/ica.2019.102004
  • [HH19b] Wei Hu and James Hu “Reinforcement Learning with Deep Quantum Neural Networks” In Journal of Quantum Information Science 9.01, 2019, pp. 1–14 DOI: 10.4236/jqis.2019.91001
  • [HH19c] Wei Hu and James Hu “Training a Quantum Neural Network to Solve the Contextual Multi-Armed Bandit Problem” In Natural Science 11, 2019, pp. 17–27 DOI: 10.4236/ns.2019.111003
  • [Hic+23] Manuel Lautaro Hickmann et al. “Potential analysis of a Quantum RL controller in the context of autonomous driving” In 31st European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2023, 2023, pp. 263–268 DOI: 10.14428/esann/2023.ES2023-22
  • [HK21] Tobias Haug and MS Kim “Optimal training of variational quantum algorithms without barren plateaus” In arXiv:2104.14543, 2021 DOI: 10.48550/arXiv.2104.14543
  • [Hsi+22] Jen-Yueh Hsiao et al. “Unentangled quantum reinforcement learning agents in the OpenAI Gym” In arXiv:2203.14348, 2022 DOI: 10.48550/arXiv.2203.14348
  • [HSS06] Tomoki Hamagami, Takashi Shibuya and Shingo Shimada “Complex-valued reinforcement learning” In 2006 IEEE International Conference on Systems, Man and Cybernetics 5, 2006, pp. 4175–4179 DOI: 10.1109/ICSMC.2006.384789
  • [HSW89] Kurt Hornik, Maxwell Stinchcombe and Halbert White “Multilayer feedforward networks are universal approximators” In Neural Netw. 2.5, 1989, pp. 359–366 DOI: 10.1016/0893-6080(89)90020-8
  • [Hu+21] Yazhou Hu, Fengzhen Tang, Jun Chen and Wenxue Wang “Quantum-enhanced reinforcement learning for control: a preliminary study” In Control. Theory Technol. 19, 2021, pp. 455–464 DOI: 10.1007/s11768-021-00063-x
  • [HW22] Arne Hamann and Sabine Wölk “Performance analysis of a hybrid agent for quantum-accessible reinforcement learning” In New J. Phys. 24.3, 2022, pp. 033044 DOI: 10.1088/1367-2630/ac5b56
  • [IBM23] IBM Quantum “Qiskit Runtime Service, Sampler primitive (Version 0.9.1)”, https://quantum-computing.ibm.com/, 2023
  • [Jaš+19] Jan Jašek et al. “Experimental hybrid quantum-classical reinforcement learning by boson sampling: how to train a quantum cloner” In Optics Express 27.22, 2019, pp. 32454–32464 DOI: 10.1364/OE.27.032454
  • [Jer+21] Sofiene Jerbi et al. “Parametrized Quantum Policies for Reinforcement Learning” In Adv. Neural Inf. Process. Syst. 34, 2021, pp. 28362–28375 DOI: 10.5281/zenodo.5833370
  • [Jer+21a] Sofiene Jerbi et al. “Quantum Enhancements for Deep Reinforcement Learning in Large Spaces” In Phys. Rev. X Quantum 2.1, 2021, pp. 010328 DOI: 10.1103/PRXQuantum.2.010328
  • [Jer+23] Sofiene Jerbi, Arjan Cornelissen, Māris Ozols and Vedran Dunjko “Quantum Policy Gradient Algorithms” In 18th Conference on the Theory of Quantum Computation, Communication and Cryptography (TQC 2023), 2023, pp. 13:1–13:24 DOI: 10.4230/LIPIcs.TQC.2023.13
  • [JOA10] Thomas Jaksch, Ronald Ortner and Peter Auer “Near-optimal Regret Bounds for Reinforcement Learning” In J. Mach. Learn. Res. 11.51, 2010, pp. 1563–1600 URL: http://jmlr.org/papers/v11/jaksch10a.html
  • [Jor05] Stephen P. Jordan “Fast Quantum Algorithm for Numerical Gradient Estimation” In Phys. Rev. Lett. 95.5, 2005, pp. 050501 DOI: 10.1103/PhysRevLett.95.050501
  • [KCP23] Gyu Seon Kim, JaeHyun Chung and Soohyun Park “Realizing Stabilized Landing for Computation-Limited Reusable Rockets: A Quantum Reinforcement Learning Approach” In arXiv:2310.06541, 2023 DOI: 10.48550/arXiv.2310.06541
  • [Kha+19] Sami Khairy et al. “Reinforcement-Learning-Based Variational Quantum Circuits Optimization for Combinatorial Problems” In arXiv:1911.04574, 2019 DOI: 10.48550/arXiv.1911.04574
  • [Kha+20] Sami Khairy et al. “Learning to Optimize Variational Quantum Circuits to Solve Combinatorial Problems” In Proceedings of the AAAI Conference on Artificial Intelligence 34.03, 2020, pp. 2367–2375 DOI: 10.1609/aaai.v34i03.5616
  • [Kim+21] Tomoaki Kimura et al. “Variational Quantum Circuit-Based Reinforcement Learning for POMDP and Experimental Implementation” In Math. Probl. Eng. 2021, 2021, pp. 3511029 DOI: 10.1155/2021/3511029
  • [KLM21] Iordanis Kerenidis, Jonas Landman and Natansh Mathur “Classical and quantum algorithms for orthogonal neural networks” In arXiv:2106.07198, 2021 DOI: 10.48550/arXiv.2106.07198
  • [Köl+23] Michael Kölle et al. “Multi-Agent Quantum Reinforcement Learning using Evolutionary Optimization” In arXiv:2311.05546, 2023 DOI: 10.48550/arXiv.2311.05546
  • [Kos+21] Ilya Kostrikov, Rob Fergus, Jonathan Tompson and Ofir Nachum “Offline reinforcement learning with fisher divergence critic regularization” In International Conference on Machine Learning, 2021, pp. 5774–5783 PMLR URL: https://proceedings.mlr.press/v139/kostrikov21a.html
  • [KP20] Iordanis Kerenidis and Anupam Prakash “A quantum interior point method for LPs and SDPs” In ACM Transactions on Quantum Computing 1.1, 2020, pp. 1–32 DOI: 10.1145/3406306
  • [Kru+23] Georg Kruse, Theodora-Augustina Dragan, Robert Wille and Jeanette Miriam Lorenz “Variational Quantum Circuit Design for Quantum Reinforcement Learning on Continuous Environments” In arXiv:2312.13798, 2023 DOI: 10.48550/arXiv.2312.13798
  • [KSG21] Kunal Kashyap, Daksh Shah and Lokesh Gautam “From Classical to Quantum: A Review of Recent Progress in Reinforcement Learning” In 2021 2nd International Conference for Emerging Technology (INCET), 2021, pp. 1–5 DOI: 10.1109/INCET51464.2021.9456218
  • [KT03] Vijay Konda and John N. Tsitsiklis “On Actor-Critic Algorithms” In SIAM J. Control Optim. 42.4, 2003, pp. 1143–1166 DOI: https://doi.org/10.1137/S0363012901385691
  • [Kum+20] Aviral Kumar, Aurick Zhou, George Tucker and Sergey Levine “Conservative q-learning for offline reinforcement learning” In Advances in Neural Information Processing Systems 33, 2020, pp. 1179–1191 URL: https://proceedings.neurips.cc/paper/2020/hash/0d2b2061826a5df3221116a5085a6052-Abstract.html
  • [Kum+23] Manoj Kumar, Upasana Dohare, Sushil Kumar and Neeraj Kumar “Blockchain Based Optimized Energy Trading for E-Mobility Using Quantum Reinforcement Learning” In IEEE Trans. Veh. Technol. 72.4, 2023, pp. 5167–5180 DOI: 10.1109/TVT.2022.3225524
  • [Kun22] Leonhard Kunczik “Reinforcement Learning with Hybrid Quantum Approximation in the NISQ Context”, 2022 DOI: 10.1007/978-3-658-37616-1
  • [KVW18] Wouter Kool, Herke Van Hoof and Max Welling “Attention, learn to solve routing problems!” In arXiv:1803.08475, 2018 DOI: 10.48550/ARXIV.1803.08475
  • [Kwa+21] Yunseok Kwak et al. “Introduction to Quantum Reinforcement Learning: Theory and PennyLane-based Implementation” In 2021 International Conference on Information and Communication Technology Convergence (ICTC), 2021, pp. 416–420 DOI: 10.1109/ICTC52510.2021.9620885
  • [LAD21] Yuanjian Li, A Hamid Aghvami and Daoyi Dong “Intelligent Trajectory Planning in UAV-Mounted Wireless Networks: A Quantum-Inspired Reinforcement Learning Perspective” In IEEE Wireless Commun. Lett. 10.9, 2021, pp. 1994–1998 DOI: 10.1109/LWC.2021.3089876
  • [Lam17] Lucas Lamata “Basic protocols in quantum reinforcement learning with superconducting circuits” In Sci. Rep. 7.1, 2017, pp. 1–10 DOI: 10.1038/s41598-017-01711-6
  • [Lam21] Lucas Lamata “Quantum Reinforcement Learning with Quantum Photonics” In Photonics 8.2, 2021, pp. 33 DOI: 10.3390/photonics8020033
  • [Lam23] Lucas Lamata “Quantum Machine Learning Implementations: Proposals and Experiments” In Adv. Quantum Technol. 6.7, 2023, pp. 2300059 DOI: 10.1002/qute.202300059
  • [Lan21] Qingfeng Lan “Variational Quantum Soft Actor-Critic” In arXiv:2112.11921, 2021 DOI: 10.48550/arXiv.2112.11921
  • [LAT21] Yunchao Liu, Srinivasan Arunachalam and Kristan Temme “A rigorous and robust quantum speed-up in supervised machine learning” In Nat. Phys. 17, 2021, pp. 1013–1017 DOI: 10.1038/s41567-021-01287-z
  • [Lev+17] Anna Levit et al. “Free energy-based reinforcement learning using a quantum processor” In arXiv:1706.00074, 2017 DOI: 10.48550/arXiv.1706.00074
  • [Lev+20] Sergey Levine, Aviral Kumar, George Tucker and Justin Fu “Offline reinforcement learning: Tutorial, review, and perspectives on open problems” In arXiv:2005.01643, 2020 DOI: 10.48550/arXiv.2005.01643
  • [LHT22] Josep Lumbreras, Erkka Haapasalo and Marco Tomamichel “Multi-armed quantum bandits: Exploration versus exploitation when learning properties of quantum states” In Quantum 6, 2022, pp. 749 DOI: 10.22331/q-2022-06-29-749
  • [Li+20] Ji-An Li et al. “Quantum reinforcement learning during human decision-making” In Nat. Hum. Behav. 4.3, 2020, pp. 294–307 DOI: 10.1038/s41562-019-0804-2
  • [Li+20a] Gen Li et al. “Breaking the sample size barrier in model-based reinforcement learning with a generative model” In Adv. Neural Inf. Process Syst. 33, 2020, pp. 12861–12872 URL: https://proceedings.neurips.cc/paper/2020/hash/96ea64f3a1aa2fd00c72faacf0cb8ac9-Abstract.html
  • [Lin92] Long-Ji Lin “Self-improving reactive agents based on reinforcement learning, planning and teaching” In Mach. Learn. 8.3, 1992, pp. 293–321 DOI: 10.1007/BF00992699
  • [Liu+22] Wenjie Liu et al. “A quantum system control method based on enhanced reinforcement learning” In Soft Comput. 26.14, 2022, pp. 6567–6575 DOI: 10.1007/s00500-022-07179-5
  • [Liu+23] Dan Liu et al. “Multi-agent quantum-inspired deep reinforcement learning for real-time distributed generation control of 100% renewable energy systems” In Eng. Appl. Artif. Intell. 119, 2023, pp. 105787 DOI: 10.1016/j.engappai.2022.105787
  • [LJ09] Mantas Lukoševičius and Herbert Jaeger “Reservoir computing approaches to recurrent neural network training” In Comput. Sci. Rev. 3.3, 2009, pp. 127–149 DOI: 10.1016/j.cosrev.2009.03.005
  • [LJW22] Yi-Pei Liu, Qing-Shan Jia and Xu Wang “Quantum reinforcement learning method and application based on value function” In IFAC-PapersOnLine 55.11, 2022, pp. 132–137 DOI: 10.1016/j.ifacol.2022.08.061
  • [LM23] Victor Lopez-Pastor and Florian Marquardt “Self-learning machines based on Hamiltonian echo backpropagation” In Phys. Rev. X 13.3, 2023, pp. 031020 DOI: 10.1103/PhysRevX.13.031020
  • [Lok+22] S Lokes et al. “Implementation of Quantum Deep Reinforcement Learning Using Variational Quantum Circuits” In 2022 International Conference on Trends in Quantum Computing and Emerging Business Technologies (TQCEBT), 2022, pp. 1–4 DOI: 10.1109/TQCEBT54229.2022.10041479
  • [Low+17] Ryan Lowe et al. “Multi-agent actor-critic for mixed cooperative-competitive environments” In Adv. Neural Inf. Process. Syst. 31, 2017, pp. 6382–6393 URL: https://dl.acm.org/doi/10.5555/3295222.3295385
  • [LP03] Michail G Lagoudakis and Ronald Parr “Least-squares policy iteration” In J. Mach. Learn. Res. 4, 2003, pp. 1107–1149 URL: https://www.jmlr.org/papers/v4/lagoudakis03a.html
  • [LS20] Owen Lockwood and Mei Si “Reinforcement Learning with Quantum Variational Circuit” In Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment 16.1, 2020, pp. 245–251 URL: https://ojs.aaai.org/index.php/AIIDE/article/view/7437
  • [LS21] Owen Lockwood and Mei Si “Playing Atari with Hybrid Quantum-Classical Reinforcement Learning” In NeurIPS 2020 Workshop on Pre-registration in Machine Learning 148, 2021, pp. 285–301 PMLR URL: https://proceedings.mlr.press/v148/lockwood21a.html
  • [Luo+20] Xiu-Zhe Luo, **-Guo Liu, Pan Zhang and Lei Wang “Yao. jl: Extensible, efficient framework for quantum algorithm design” In Quantum 4, 2020, pp. 341 DOI: 10.22331/q-2020-10-11-341
  • [LXJ23] Yaofu Liu, Chang Xu and Siyuan ** “Reinforcement Learning for Continuous Control: A Quantum Normalized Advantage Function Approach” In 2023 IEEE International Conference on Quantum Software (QSW), 2023, pp. 83–87 DOI: 10.1109/QSW59989.2023.00020
  • [LZ22] Tongyang Li and Ruizhe Zhang “Quantum Speedups of Optimizing Approximately Convex Functions with Applications to Logarithmic Regret Stochastic Convex Bandits” In Advances in Neural Information Processing Systems 35, 2022, pp. 3152–3164 URL: https://proceedings.neurips.cc/paper_files/paper/2022/hash/14f75513f0f1ca01de1e826b52e6b840-Abstract-Conference.html
  • [Mei+23] Kai Meinerz, Simon Trebst, Mark Rudner and Evert Nieuwenburg “The Quantum Cartpole: A benchmark environment for non-linear reinforcement learning” In arXiv:2311.00756, 2023 DOI: 10.48550/arXiv.2311.00756
  • [Mel+17] Alexey A. Melnikov, Adi Makmal, Vedran Dunjko and Hans J. Briegel “Projective simulation with generalization” In Sci. Rep. 7.1, 2017, pp. 1–14 DOI: 10.1038/s41598-017-14740-y
  • [Mey+23] Nico Meyer et al. “Quantum Natural Policy Gradients: Towards Sample-Efficient Reinforcement Learning” In IEEE International Conference on Quantum Computing and Engineering (QCE) 2, 2023, pp. 36–41 DOI: 10.1109/QCE57702.2023.10181
  • [Mey+23a] Nico Meyer et al. “Quantum Policy Gradient Algorithm with Optimized Action Decoding” In International Conference on Machine Learning (ICML) 202, 2023, pp. 24592–24613 PMLR URL: https://proceedings.mlr.press/v202/meyer23a.html
  • [Mey21] Nico Meyer “Variational Quantum Circuits for Policy Approximation”, 2021
  • [MK21] Maximilian Moll and Leonhard Kunczik “Comparing quantum hybrid reinforcement learning to classical methods” In Hum. Intell. Syst. Integr. 3.1, 2021, pp. 15–23 DOI: 10.1007/s42454-021-00025-3
  • [ML21] José D Martín-Guerrero and Lucas Lamata “Reinforcement Learning and Physics” In Appl. Sci. 11.18, 2021, pp. 8589 DOI: 10.3390/app11188589
  • [ML22] José D. Martín-Guerrero and Lucas Lamata “Quantum Machine Learning: A tutorial” In Neurocomputing 470, 2022, pp. 457–461 DOI: 10.1016/j.neucom.2021.02.102
  • [Mni+15] Volodymyr Mnih et al. “Human-level control through deep reinforcement learning” In Nature 518.7540, 2015, pp. 529–533 DOI: 10.1038/nature14236
  • [Mni+16] Volodymyr Mnih et al. “Asynchronous methods for deep reinforcement learning” In International Conference on Machine Learning (ICML) 48 PMLR, 2016, pp. 1928–1937 URL: https://proceedings.mlr.press/v48/mniha16.html
  • [MNM17] Masaki Mochida, Hidehiro Nakano and Arata Miyauchi “A complex-valued reinforcement learning method using complex-valued neural networks” In IEICE Technical Report; IEICE Tech. Rep. 117.112, 2017, pp. 1–5 URL: https://ken.ieice.org/ken/paper/20170629ebuV/eng/
  • [Mon15] Ashley Montanaro “Quantum speedup of Monte Carlo methods” In Proc. Math. Phys. Eng. Sci. 471.2181, 2015, pp. 20150301 DOI: 10.1098/rspa.2015.0301
  • [Mül+21] Tobias Müller, Christoph Roch, Kyrill Schmid and Philipp Altmann “Towards Multi-Agent Reinforcement Learning using Quantum Boltzmann Machines” In arXiv:2109.10900, 2021 DOI: 10.48550/arXiv.2109.10900
  • [MVB22] Thomas Mullor, David Vigouroux and Louis Bethune “Efficient circuit implementation for coined quantum walks on binary trees and application to reinforcement learning” In IEEE/ACM Symposium on Edge Computing (SEC), 2022, pp. 436–443 DOI: 10.1109/SEC54971.2022.00066
  • [Nag+21] Dániel Nagy et al. “Photonic quantum policy learning in OpenAI Gym” In IEEE International Conference on Quantum Computing and Engineering (QCE), 2021, pp. 123–129 DOI: 10.1109/QCE52317.2021.00028
  • [Neu+17] Florian Neukart et al. “Traffic flow optimization using a quantum annealer” In Front. ICT 4, 2017, pp. 29 DOI: 10.3389/fict.2017.00029
  • [Neu+20] Niels MP Neumann, Paolo BUL Heer, Irina Chiscop and Frank Phillipson “Multi-agent Reinforcement Learning Using Simulated Quantum Annealing” In International Conference on Computational Science, 2020, pp. 562–575 DOI: 10.1007/978-3-030-50433-5_43
  • [NGC15] Sinan Nuuman, David Grace and Tim Clarke “A quantum inspired reinforcement learning technique for beyond next generation wireless networks” In 2015 IEEE Wireless Communications and Networking Conference Workshops (WCNCW), 2015, pp. 271–275 DOI: 10.1109/WCNCW.2015.7122566
  • [NHP23] Niels MP Neumann, Paolo BUL Heer and Frank Phillipson “Quantum reinforcement learning: Comparing quantum annealing and gate-based quantum computing with classical deep reinforcement learning” In Quantum Inf. Process. 22.2, 2023, pp. 125 DOI: 10.1007/s11128-023-03867-9
  • [Nir+21] Dipesh Niraula et al. “Quantum deep reinforcement learning for clinical decision support in oncology: application to adaptive radiotherapy” In Sci. Rep. 11.1, 2021, pp. 1–13 DOI: 10.1038/s41598-021-02910-y
  • [NL16] M.A. Nielsen and Chuang I. L. “Quantum Computation and Quantum Information (10th Anniversary edition)” Cambridge University Press, 2016 DOI: 10.1017/CBO9780511976667
  • [NLH20] Rui Nian, **feng Liu and Biao Huang “A review On reinforcement learning: Introduction and applications in industrial process control” In Comput. Chem. Eng. 139, 2020, pp. 106886 DOI: 10.1016/j.compchemeng.2020.106886
  • [NS+23] Bhaskara Narottama and Soo Young Shin “Layerwise Quantum Deep Reinforcement Learning for Joint Optimization of UAV Trajectory and Resource Allocation” In IEEE Internet Things J., 2023 DOI: 10.1109/JIOT.2023.3285968
  • [NW05] Sanjeev Naguleswaran and Langford B. White “Quantum search in stochastic planning” In Noise and Information in Nanoelectronics, Sensors, and Standards III 5846, 2005, pp. 34–45 DOI: 10.1117/12.609962
  • [NY23] Egor E Nuzhin and Dmitry Yudin “Quantum-enhanced policy iteration on the example of a mountain car” In arXiv:2308.08348, 2023 DOI: 10.48550/arXiv.2308.08348
  • [Oli+20] Julio Olivares-Sánchez, Jorge Casanova, Enrique Solano and Lucas Lamata “Measurement-Based Adaptation Protocol with Quantum Reinforcement Learning in a Rigetti Quantum Computer” In Quantum Reports 2.2, 2020, pp. 293–304 DOI: 10.3390/quantum2020019
  • [Pap+14] Giuseppe Davide Paparo et al. “Quantum Speedup for Active Learning Agents” In Phys. Rev. X 4.3, 2014, pp. 031002 DOI: 10.1103/PhysRevX.4.031002
  • [Par+23] Chanyoung Park et al. “Quantum Multi-Agent Actor-Critic Networks for Cooperative Mobile Access in Multi-UAV Systems” In IEEE Internet Things J. 10.22, 2023, pp. 20033–20048 DOI: 10.1109/JIOT.2023.3282908
  • [Par+23a] Soohyun Park et al. “Quantum Multi-Agent Reinforcement Learning for Autonomous Mobility Cooperation” In IEEE Commun. Mag., 2023 DOI: 10.1109/MCOM.020.2300199
  • [Per+06] David Perez-Garcia, Frank Verstraete, Michael M Wolf and J Ignacio Cirac “Matrix product state representations” In arXiv:0608197, 2006 DOI: 10.48550/arXiv.quant-ph/0608197
  • [Pér+20] Adrián Pérez-Salinas, Alba Cervera-Lierta, Elies Gil-Fuster and José I Latorre “Data re-uploading for a universal quantum classifier” In Quantum 4, 2020, pp. 226 DOI: 10.22331/q-2020-02-06-226
  • [Per+22] Maniraman Periyasamy et al. “Incremental Data-Uploading for Full-Quantum Classification” In IEEE International Conference on Quantum Computing and Engineering (QCE), 2022, pp. 31–37 DOI: 10.1109/QCE53715.2022.00021
  • [Per+23] Maniraman Periyasamy et al. “Batch Quantum Reinforcement Learning” In arXiv:2305.00905, 2023 DOI: 10.48550/arXiv.2305.00905
  • [Pes+21] Arthur Pesah et al. “Absence of barren plateaus in quantum convolutional neural networks” In Phys. Rev. X 11.4, 2021, pp. 041011 DOI: 10.1103/PhysRevX.11.041011
  • [PK23] Soohyun Park and Joongheon Kim “Quantum Reinforcement Learning for Large-Scale Multi-Agent Decision-Making in Autonomous Aerial Networks” In 2023 VTS Asia Pacific Wireless Communications Symposium (APWCS), 2023, pp. 1–4 DOI: 10.1109/APWCS60142.2023.10233966
  • [PMV02] Vladimir Privman, Dima Mozyrsky and Israel Vagner “Quantum computing with spin qubits in semiconductor structures” In Comput. Phys. Commun. 146, 2002, pp. 331–338 DOI: 10.1016/S0010-4655(02)00424-1
  • [PPR20] Daniel K. Park, Jonghun Park and June-Koo Kevin Rhee “Quantum-classical reinforcement learning for decoding noisy classical parity information” In Quantum Mach. Intell. 2.1, 2020, pp. 1–11 DOI: 10.1007/s42484-020-00019-5
  • [PRD96] Elena Pashenkova, Irina Rish and Rina Dechter “Value iteration and policy iteration algorithms for Markov decision problem” Citeseer, 1996 URL: https://www.researchgate.net/publication/2605845_Value_iteration_and_policy_iteration_algorithms_for_Markov_decision_problem
  • [Rai+23] Serge Rainjonneau et al. “Quantum algorithms applied to satellite mission planning for Earth observation” In IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 16, 2023, pp. 7062–7075 DOI: 10.1109/JSTARS.2023.3287154
  • [Raj+21] K Rajagopal et al. “Quantum amplitude amplification for reinforcement learning” In Handbook of Reinforcement Learning and Control 325, 2021, pp. 819–833 DOI: 10.1007/978-3-030-60990-0_26
  • [Ram17] A Ramezanpour “Optimization by a quantum reinforcement algorithm” In Phys. Rev. A 96.5, 2017, pp. 052307 DOI: 10.1103/PhysRevA.96.052307
  • [Ree23] Volker Reers “Towards Performance Benchmarking for Quantum Reinforcement Learning” In INFORMATIK 2023 - Designing Futures: Zukünfte gestalten. Gesellschaft für Informatik eV, 2023, pp. 1135–1145 DOI: 10.18420/inf2023_126
  • [Ren+22] Yuzheng Ren et al. “NFT-based intelligence networking for connected and autonomous vehicles: A quantum reinforcement learning approach” In IEEE Network 36.6, 2022, pp. 116–124 DOI: 10.1109/MNET.107.2100469
  • [RKM22] Farhad Rezazadeh, Sarang Kahvazadeh and Mohammadreza Mosahebfard “Towards Quantum-Enabled 6G Slicing” In arXiv:2212.11755, 2022 DOI: 10.48550/arXiv.2212.11755
  • [RN94] Gavin A Rummery and Mahesan Niranjan “On-line Q𝑄Qitalic_Q-learning using connectionist systems” Citeseer, 1994 URL: https://www.researchgate.net/publication/2500611_On-Line_Q-Learning_Using_Connectionist_Systems
  • [Ron19] Pooya Ronagh “The Problem of Dynamic Programming on a Quantum Computer” In arXiv:1906.02229, 2019 DOI: 10.48550/arXiv.1906.02229
  • [Sag+21] Valeria Saggio et al. “Experimental quantum speed-up in reinforcement learning agents” In Nature 591.7849, 2021, pp. 229–233 DOI: 10.1038/s41586-021-03242-7
  • [Sag+21a] Valeria Saggio et al. “Quantum speed-ups in reinforcement learning” In Quantum Nanophotonic Materials, Devices, and Systems 2021 11806, 2021, pp. 40–49 DOI: 10.1117/12.2593720
  • [San+22] Fabio Sanches, Sean Weinberg, Takanori Ide and Kazumitsu Kamiya “Short quantum circuits in reinforcement learning policies for the vehicle routing problem” In Phys. Rev. A 105.6, 2022, pp. 062403 DOI: 10.1103/PhysRevA.105.062403
  • [San+23] Antonio Sannia et al. “A hybrid classical-quantum approach to speed-up Q-learning” In Sci. Rep. 13.1, 2023, pp. 3913 DOI: 10.1038/s41598-023-30990-5
  • [SB18] R.S. Sutton and A.G. Barto “Reinforcement Learning: An Introduction” The MIT Press, 2018 URL: http://incompleteideas.net/book/the-book-2nd.html
  • [Sch+22] Michael Schenk et al. “Hybrid actor-critic algorithm for quantum reinforcement learning at cern beam lines” In arXiv:2209.11044, 2022 DOI: 10.48550/arXiv.2209.11044
  • [SH20] Erik Sorensen and Wei Hu “Practical Meta-Reinforcement Learning of Evolutionary Strategy with Quantum Neural Networks for Stock Trading” In Journal of Quantum Information Science 10.3, 2020, pp. 43–71 DOI: 10.4236/jqis.2020.103005
  • [SH23] Maida Shahid and Muhammad Awais Hassan “Introducing Quantum Variational Circuit for Efficient Management of Common Pool Resources” In IEEE Access 11, 2023, pp. 110862–110877 DOI: 10.1109/ACCESS.2023.3322144
  • [She+20] Kishore S Shenoy, Dev Y Sheth, Bikash K Behera and Prasanta K Panigrahi “Demonstration of a measurement-based adaptation protocol with quantum reinforcement learning on the IBM Q experience platform” In Quantum Inf. Process. 19, 2020, pp. 1–13 DOI: 10.1007/s11128-020-02657-x
  • [Shi+22] Hiroaki Shinkawa et al. “Bandit approach to conflict-free multi-agent Q-learning in view of photonic implementation” In arXiv:2212.09926, 2022 DOI: 10.48550/arXiv.2212.09926
  • [Sho97] Peter W. Shor “Polynomial-Time Algorithms for Prime Factorization and Discrete Logarithms on a Quantum Computer” In SIAM J. Comput. 26.5, 1997, pp. 1484–1509 DOI: 10.1137/s0097539795293172
  • [Sil+18] David Silver et al. “A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play” In Science 362.6419, 2018, pp. 1140–1144 DOI: 10.1126/science.aar6404
  • [SJA19] Sukin Sim, Peter D Johnson and Alán Aspuru-Guzik “Expressibility and entangling capability of parameterized quantum circuits for hybrid quantum-classical algorithms” In Adv. Quantum Technol. 2.12, 2019, pp. 1900070 DOI: 10.1002/qute.201900070
  • [SJD22] Andrea Skolik, Sofiene Jerbi and Vedran Dunjko “Quantum agents in the Gym: a variational quantum algorithm for deep Q-learning” In Quantum 6, 2022, pp. 720 DOI: 10.22331/q-2022-05-24-720
  • [Sko+23] Andrea Skolik et al. “Robustness of quantum reinforcement learning under hardware errors” In EPJ Quantum Technol. 10.1, 2023, pp. 1–43 DOI: 10.1140/epjqt/s40507-023-00166-1
  • [SMK23] Akash Sinha, Antonio Macaluso and Matthias Klusch “Nav-Q: Quantum Deep Reinforcement Learning for Collision-Free Navigation of Self-Driving Cars” In arXiv:2311.12875, 2023 DOI: 10.48550/arXiv.2311.12875
  • [SMT23] Yize Sun, Yunpu Ma and Volker Tresp “Differentiable Quantum Architecture Search for Quantum Reinforcement Learning” In IEEE International Conference on Quantum Computing and Engineering (QCE) 2, 2023, pp. 15–19 DOI: 10.1109/QCE57702.2023.10177
  • [SP18] Maria Schuld and Francesco Petruccione “Supervised Learning with Quantum Computers” Springer, 2018 URL: https://link.springer.com/book/10.1007/978-3-319-96424-9
  • [Sri+18] Theeraphot Sriarunothai et al. “Speeding-up the decision making of a learning agent using an ion trap quantum processor” In Quantum Sci. Technol. 4.1, 2018, pp. 015014 DOI: 10.1088/2058-9565/aaef5e
  • [SSB23] André Sequeira, Luis Paulo Santos and Luis Soares Barbosa “Policy gradients using variational quantum circuits” In Quantum Mach. Intell. 5.1, 2023, pp. 18 DOI: 10.1007/s42484-023-00101-8
  • [SSM21] M. Schuld, R. Sweke and J.J. Meyer “Effect of data encoding on the expressive power of variational quantum-machine-learning models” In Phys. Rev. A 103.3, 2021, pp. 032430 DOI: 10.1103/physreva.103.032430
  • [Sto+20] James Stokes, Josh Izaac, Nathan Killoran and Giuseppe Carleo “Quantum Natural Gradient” In Quantum 4, 2020, pp. 269 DOI: 10.22331/q-2020-05-25-269
  • [Sut+99] Richard S. Sutton, David McAllester, Satinder Singh and Yishay Mansour “Policy gradient methods for reinforcement learning with function approximation” In NeurIPS 12, 1999 URL: https://papers.nips.cc/paper/1999/hash/464d828b85b0bed98e80ade0a5c43b0f-Abstract.html
  • [SWM10] Mark Saffman, Thad G Walker and Klaus Mølmer “Quantum information with Rydberg atoms” In Rev. Mod. Phys. 82.3, 2010, pp. 2313 DOI: 10.1103/RevModPhys.82.2313
  • [Tei21] Miguel Alexandre Brandão Teixeira “Quantum Reinforcement Learning Applied to Games”, 2021 URL: https://repositorio-aberto.up.pt/bitstream/10216/135628/2/487581.pdf
  • [Tha+23] Supanut Thanasilp et al. “Subtleties in the trainability of quantum machine learning models” In Quantum Mach. Intell. 5.1, 2023, pp. 21 DOI: 10.1007/s42484-023-00103-6
  • [TRC21] Miguel Teixeira, Ana Paula Rocha and Antonio JM Castro “Quantum Reinforcement Learning Applied to Board Games” In IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, 2021, pp. 343–350 DOI: 10.1145/3486622.3493944
  • [Tru+23] Nguyen Truong Thu Ngo, Tien-Fu Lu, James Quach and Peter Bruza “Investigating Quantum Reinforcement Learning structure to the CartPole control task” In Proceedings of the 9th International Conference of Asian Society for Precision Engineering and Nanotechnology (ASPEN2022), 2023, pp. 227–230 URL: https://eprints.qut.edu.au/239327/
  • [VGS16] Hado Van Hasselt, Arthur Guez and David Silver “Deep reinforcement learning with double q-learning” In Proceedings of the AAAI Conference on Artificial Intelligence 30.1, 2016 DOI: 10.1609/aaai.v30i1.10295
  • [Wan+21] Daochen Wang et al. “Quantum algorithms for reinforcement learning with a generative model” In International Conference on Machine Learning (ICML) 139 PMLR, 2021, pp. 10916–10926 URL: https://proceedings.mlr.press/v139/wang21w.html
  • [Wan+21a] Daochen Wang, Xuchen You, Tongyang Li and Andrew M Childs “Quantum exploration algorithms for multi-armed bandits” In Proceedings of the AAAI Conference on Artificial Intelligence 35.11, 2021, pp. 10102–10110 DOI: 10.1609/aaai.v35i11.17212
  • [Wan+23] Zongqi Wan et al. “Quantum multi-armed bandits and stochastic linear bandits enjoy logarithmic regrets” In Proceedings of the AAAI Conference on Artificial Intelligence 37.8, 2023, pp. 10087–10094 DOI: 10.1609/aaai.v37i8.26202
  • [WAU20] Zhikang T Wang, Yuto Ashida and Masahito Ueda “Deep reinforcement learning control of quantum cartpoles” In Phys. Rev. Lett. 125.10, 2020, pp. 100401 DOI: 10.1103/PhysRevLett.125.100401
  • [WD92] Christopher J.C.H. Watkins and Peter Dayan “Q-learning” In Mach. Learn. 8.3, 1992, pp. 279–292 DOI: 10.1007/BF00992698
  • [Wei+21] Qing Wei, Hailan Ma, Chunlin Chen and Daoyi Dong “Deep Reinforcement Learning With Quantum-Inspired Experience Replay” In IEEE Trans. Cybern. 52.9, 2021, pp. 9326–9338 DOI: 10.1109/TCYB.2021.3053414
  • [Wie+22] Simon Wiedemann, Daniel Hein, Steffen Udluft and Christian Mendl “Quantum Policy Iteration via Amplitude Estimation and Grover Search–Towards Quantum Advantage for Reinforcement Learning” In arXiv:2206.04741, 2022 DOI: 10.48550/arXiv.2206.04741
  • [Wie+22a] David Wierichs, Josh Izaac, Cody Wang and Cedric Yen-Yu Lin “General parameter-shift rules for quantum gradients” In Quantum 6, 2022, pp. 677 DOI: 10.22331/q-2022-03-30-677
  • [Wie+23] Marco Wiedmann et al. “An Empirical Comparison of Optimizers for Quantum Machine Learning with SPSA-based Gradients” In IEEE International Conference on Quantum Computing and Engineering (QCE) 1, 2023, pp. 450–456 DOI: 10.1109/QCE57702.2023.00058
  • [Wie21] Simon Wiedemann “Modelling and Solving Reinforcement Learning Problems on Quantum Computers”, 2021
  • [Wil92] Ronald J. Williams “Simple statistical gradient-following algorithms for connectionist reinforcement learning” In Mach. Learn. 8.3, 1992, pp. 229–256 DOI: 10.1007/BF00992696
  • [Wu+23] Shaojun Wu, Shan **, Dingding Wen and Xiaoting Wang “Quantum reinforcement learning in continuous action space” In arXiv:2012.10711, 2023 DOI: 10.48550/arXiv.2012.10711
  • [Yan+22] Rudai Yan, Yu Wang, Yan Xu and Jiahong Dai “A multiagent quantum deep reinforcement learning method for distributed frequency control of islanded microgrids” In IEEE Trans. Control Netw. Syst. 9.4, 2022, pp. 1622–1632 DOI: 10.1109/TCNS.2022.3140702
  • [Yan23] Junzheng Yang “Apply Deep Reinforcement Learning with Quantum Computing on the Pricing of American Options” In Internet Finance and Digital Economy, 2023, pp. 675–694 DOI: 10.1142/9789811267505_0050
  • [Yin+21] Linfei Yin et al. “Quantum deep reinforcement learning for rotor side converter control of double-fed induction generator-based wind turbines” In Engineering Applications of Artificial Intelligence 106, 2021, pp. 104451 DOI: 10.1016/j.engappai.2021.104451
  • [YN06] J.Q. You and Franco Nori “Superconducting Circuits and Quantum Information” In Phys. Today 58.11, 2006, pp. 42 DOI: 10.1063/1.2155757
  • [YPK23] Won Joon Yun, Jihong Park and Joongheon Kim “Quantum multi-agent meta reinforcement learning” In Proceedings of the AAAI Conference on Artificial Intelligence 37.9, 2023, pp. 11087–11095 DOI: 10.1609/aaai.v37i9.26313
  • [Yun+22] Won Joon Yun et al. “Quantum multi-agent reinforcement learning via variational quantum circuit design” In 2022 IEEE 42nd International Conference on Distributed Computing Systems (ICDCS), 2022, pp. 1332–1335 DOI: 10.1109/ICDCS54860.2022.00151
  • [Yun+23] Won Joon Yun et al. “Quantum Multi-Agent Actor-Critic Neural Networks for Internet-Connected Multi-Robot Coordination in Smart Factory Management” In IEEE Internet Things J. 10.11, 2023, pp. 9942–9952 DOI: 10.1109/JIOT.2023.3234911
  • [Zha+11] Tingting Zhao, Hirotaka Hachiya, Gang Niu and Masashi Sugiyama “Analysis and Improvement of Policy Gradient Estimation” In Adv. Neural Inf. Process. Syst. 24, 2011, pp. 118–129 DOI: 10.1016/j.neunet.2011.09.005
  • [Zha+19] Xiao-Ming Zhang et al. “When does reinforcement learning stand out in quantum control? A comparative study on state preparation” In NPJ Quantum Inf. 5.1, 2019, pp. 85 DOI: 10.1038/s41534-019-0201-8
  • [Zha+22] Shi-Xin Zhang, Chang-Yu Hsieh, Shengyu Zhang and Hong Yao “Differentiable quantum architecture search” In Quantum Sci. Technol. 7.4, 2022, pp. 045023 DOI: 10.1088/2058-9565/ac87cd
  • [Zho+23] Han Zhong et al. “Provably Efficient Exploration in Quantum Reinforcement Learning with Logarithmic Worst-Case Regret” In arXiv:2302.10796, 2023 DOI: 10.48550/arXiv.2302.10796
  • [Zie+08] Brian D. Ziebart, Andrew L. Maas, J.Andrew Bagnell and Anind K. Dey “Maximum entropy inverse reinforcement learning” In AAAI 8, 2008, pp. 1433–1438 URL: https://www.aaai.org/Papers/AAAI/2008/AAAI08-227.pdf
  • [ZY23] Jun Zhao and Wenhan Yu “Quantum Multi-Agent Reinforcement Learning as an Emerging AI Technology: A Survey and Future Directions” In TechRxiv, 2023 DOI: 10.36227/techrxiv.24563293.v1