A Survey on Quantum Reinforcement Learning

Nico Meyer Christian Ufrecht Maniraman Periyasamy Daniel D. Scherer Axel Plinge and Christopher Mutschler
Fraunhofer IIS Fraunhofer Institute for Integrated Circuits IIS Nuremberg Germany
{firstname.lastname|daniel.scherer2}@iis.fraunhofer.de

(January 1, 2024)

Abstract

Quantum reinforcement learning is an emerging field at the intersection of quantum computing and machine learning. While we intend to provide a broad overview of the literature on quantum reinforcement learning – our interpretation of this term will be clarified below – we put particular emphasis on recent developments. With a focus on already available noisy intermediate-scale quantum devices, these include variational quantum circuits acting as function approximators in an otherwise classical reinforcement learning setting. In addition, we survey quantum reinforcement learning algorithms based on future fault-tolerant hardware, some of which come with a provable quantum advantage. We provide both a birds-eye-view of the field, as well as summaries and reviews for selected parts of the literature.

1 Introduction and Overview

With recent advances in the fabrication and control of hardware for quantum information processing, the possibilities of merging quantum computing (QC) with machine learning (ML) have received a huge amount of attention within the growing research community. Hereby, reinforcement learning (RL) is the third paradigm besides supervised and unsupervised learning. In this survey article, we provide an overview over so-called quantum reinforcement learning (QRL) algorithms. We understand these as quantum-assisted approaches, that solve a particular task (be they classical or quantum in nature) by employing quantum resources (either in simulation and/or in experiment).

In order to keep this contribution as self-contained as possible, we provide the necessary backgrounds before venturing into the QRL literature. We start out with a brief recap of the essentials of the RL paradigm in the fully classical setting in Sec. 2. Further, in Sec. 3 we provide a quick introduction to QC and variational quantum circuits (VQCs). Readers familiar with either of the topics may safely skip these sections.

Refer to caption — Figure 1: A possible classification matrix for QRL algorithms, where we took into account only those variants of QRL which we focus on in Sec. 4. The algorithm classes are ordered according to their degree of quantum-classical hybridization, ranging from purely classical to purely quantum. A more detailed review of the $22$ selected works on QiRL-algorithms can be found in Sec. 4.1. VQC-based approaches are summarized in quite some detail in Sec. 4.2 – comprising of $68$ papers. QRL-algorithms employing post-NISQ quantum algorithms as subroutines or even fully quantum approaches to QRL are described in Sec. 4.3, Sec. 4.4, Sec. 4.5 and Sec. 4.6, based on $30$ selected manuscripts. The dashed vertical line between classical and NISQ compute resources indicates that presently it is unclear whether QRL with NISQ-compatible algorithms offers robust quantum advantage on a broad range of learning problems. The solid vertical line distinguishes post-NISQ algorithms from both classical and NISQ-compatible algorithms, as they typically come with guaranteed quantum advantage (at least relative to their classical counterparts).

In Sec. 4 we turn our attention to the emerging field of QRL, starting out with a quick overview of the literature. Then we delve into summaries of the most prominent contributions. This selection is necessarily subjective and reflects our own research interests – overall we identified $177$ relevant manuscripts, of which we reviewed $120$ explicitly. For a detailed overview on paper counts see LABEL:tab:number_publications. We organized our summaries into several blocks, that are ordered by what one could call an increasing degree of ‘quantiziation’. The first of these blocks in Sec. 4.1 covers what we refer to as ‘quantum-inspired’ RL algorithms. The second block in Sec. 4.2 takes a rather detailed look at QRL algorithms that employ so-called VQCs as function approximators. In many cases, the corresponding algorithms are obtained by simply replacing a standard neural network function approximator (or any other sort) by an appropriate VQC. We provide detailed summaries for most papers in this category, as variational quantum algorithms are believed to offer the potential to obtain quantum advantage despite the limitations of present day NISQ hardware. In Secs. 4.3 and 4.4, we take a look at realizations of QRL based on so-called projective simulation and the use of Boltzmann machines as function approximators, respectively. In Sec. 4.5 we move to a class of approaches that employ quantum algorithms as subroutines. The corresponding hardware requirements will likely be compatible only with universal, fault-tolerant and error-corrected quantum processing units (QPUs). Finally, Sec. 4.6 provides a summary for a formal approach to QRL, which treats all components of RL ‘quantumly’. From our point of view, the highest degree of quantization can thus be found in these approaches. Fig. 1 gives an overview of the QRL literature as understood in this survey.

	$\leq 2018$	$2019$	$2020$	$2021$	$2022$	$2023$	$\Sigma$
Quantum-inspired QRL	$12$	$1$	$2$	$5$	$2$	$0$	$22$
VQC-based QRL	$0$	$2$	$2$	$9$	$21$	$34$	$68$
QRL application ${}^{\boldsymbol{\mathrm{a}}}$	$0$	$0$	$0$	$1$	$7$	$18$	$26$
Post-NISQ QRL	$12$	$2$	$2$	$6$	$4$	$4$	$30$

Finally, in Sec. 5 we state our concluding thoughts on the current state-of-the-art of QRL. Before moving to more technical content, we would like to express our hope that this literature survey on QRL will be of use to colleagues and collaborators and the wider QC research community. It represents our effort to familiarize ourselves with QRL and its main research directions.

2 Classical Reinforcement Learning

Compared to the methods of supervised and unsupervised learning, which are typically implemented as passive learning, RL falls into the class of interaction-based learning [SB18]. On an abstract level, the learner interacts with its environment, the state of which it can either fully or only partially observe through a corresponding observation obtained after executing an action according to an underlying policy. In the RL paradigm, the learner is therefore appropriately referred to as an agent: it can - be it in simulation or in the real world - interact with its environment according to its abilities. The aim of RL is to learn a policy through the interaction of the agent with the environment, which is optimal with regard to a reward adapted to the problem. In other words, the agent should find an optimal policy during the learning process in the abstract space of all policies, which maximizes the expected cumulative reward. The theoretical basis for RL is formed by so-called Markov decision processes (MDPs) and the associated Bellman equation, which represents a consistency equation for the so-called value function. In turn, an optimal policy can be extracted from the optimal value function. Alternatively, the optimal policy can also be learned directly. Under certain conditions, the elements of RL can be mapped to their respective equivalents in control theory, where typically a dynamic optimization problem is solved by gradient-based methods with simulation of the corresponding model dynamics. On the RL side, there are both model-based and model-free approaches. The model-free approach in particular is one of the strengths of the RL method, since in many cases state and action spaces are too high-dimensional to design realistic dynamical models and simulate them to efficiently find optimal control strategies. The large dimensions of the spaces that occur in realistic problems make the use of approximation methods for the value function necessary. Driven by the breakthroughs in deep learning (DL), artificial neural networks (NNs) have established themselves as function approximators for both value function and policy (understood as a deterministic or probabilistic map** of states to actions), thus establishing the field of deep reinforcement learning (DRL).

In the following, we will introduce the various notions pertaining to RL in a more formal way and provide the background necessary to understand the basic RL terminology. In an RL scenario, the algorithm, also referred to as agent, generates its own data by interacting with an environment. This interaction happens over some discrete timesteps $t$ , which are accumulated to episodes with either finite or infinite horizon. In each timestep, the agent is able to make an observation $s_{t}\in\mathcal{S}$ of the environment. Based on this state information, an action $a_{t}\in\mathcal{A}$ acting on the environment is selected according to a policy. Based on the (usually unknown) environment dynamics, the next state $s_{t+1}\in\mathcal{S}$ is observed from the environment and the agent receives a reward $r_{t}\in\mathcal{R}$ for its choice. The agent should select the actions in such a way that some objective is optimized, usually related to the long term reward. A sketch of this pipeline can be found in Fig. 2. In this survey article, we follow the formalism and notation of Sutton et al. [SB18], with small adaptions wherever we feel that it eases comprehension.

Reinforcement Learning as a Markov Decision Process

More formally, this setup is usually described as an MDP. A finite MDP is a 5-tuple $(\mathcal{S},\mathcal{A},\mathcal{R},p,\gamma)$ , where the sets $\mathcal{S}$ , $\mathcal{A}$ and $\mathcal{R}$ are finite. It is defined by the following components:

•

A set of states $\mathcal{S}$ the agent can observe from the environment
•

A set of actions $\mathcal{A}$ the agent can execute in the environment
•

A set of rewards $\mathcal{R}\subset\mathbb{R}$ the agent can receive from the environment
•

The environment dynamics $p:\mathcal{S}\times\mathcal{R}\times\mathcal{A}\times\mathcal{S}\to[0,1]$ ; The value $p(s^{\prime},r|s,a):=\text{Pr}\{s_{t+1}=s^{\prime},r_{t}=r|s_{t}=s,a_{t}=a\}$ gives the probability that the environment transitions to state $s_{t+1}$ and the agent receives reward $r_{t}$ , if the agents executes action $a_{t}$ in state $s_{t}$ at time $t$ .
•

The discount factor $0\leq\gamma\leq 1$ , more on this below;

The dynamics of the environment are often not accessible to the agent, otherwise the task collapses to (not necessarily trivial) dynamic programming. The function $p$ satisfies the properties of a probability density function (PDF), i.e., it holds $\sum_{s^{\prime}\in\mathcal{S},r\in\mathcal{R}}p(s^{\prime},r|s,a)=1$ , for all choices of $s\in\mathcal{S}$ and $a\in\mathcal{A}$ . According to the Markov property, the dynamics are completely described by $p$ , i.e., the consecutive state $s_{t+1}$ and reward $r_{t}$ depend solely on the directly preceding state $s_{t}$ and action $a_{t}$ .

With this framework in mind, the interaction between agent and environment can be described as a trajectory $\tau$ . For a finite or infinite horizon $H$ , one episode is therefore given by the sequence

\tau=\left[s_{0},a_{0},r_{0},s_{1},a_{1},r_{1},s_{2},\cdots,s_{H-1},a_{H-1},r_% {H-1}\right],

(1)

with $s_{t}\in\mathcal{S}$ , $a_{t}\in\mathcal{A}$ , and $r_{t}$ sampled following the environment dynamics for each timestep $t$ .

Long Term Reward as Objective

The agent gets feedback from the environment through the immediate rewards $r_{t}$ . However, instead of maximizing these short-term rewards, it is much more appropriate to use some long term measure as objective. A natural choice is to go for the cumulative reward, also referred to as the expected return

G_{t}:=r_{t}+r_{t+1}+r_{t+2}+\cdots+r_{H-1}.

(2)

For episodic tasks ( $H<\infty$ ) it is often desirable and for continuous tasks ( $H=\infty$ ) it is necessary to use a discount factor $\gamma$ . This leads to the discounted (expected) return

G_{t}:=\sum\limits_{t^{\prime}=t}^{H-1}\gamma^{t^{\prime}-t}\cdot r_{t^{\prime% }},

(3)

where each choice of $\gamma$ defines a different MDP. For $\gamma<1$ the value of $G_{t}$ is guaranteed to be finite and emphasis on individual rewards decreases with distance from the current time-step. For $\gamma=0$ the sum reduces to just the immediate reward, so an appropriate choice of this hyperparameter is crucial for the potential success of the RL agent.

Policy, Value Functions and Optimality

In order to describe a meaningful RL setup, there are still some concepts missing. As described above, the agent needs to decide for an action in every timestep, depending on the state information that is observed. This decision making process can be understood as a (stochastic) policy

\pi\left(a|s\right):=\text{Pr}\{a_{t}=a|s_{t}=s\},

(4)

where $\sum_{a\in\mathcal{A}}\pi\left(a|s\right)=1$ holds for all $s\in\mathcal{A}$ . The overall task of RL is to derive an optimal policy $\pi^{*}$ w.r.t. some metric.

A suitable tool to define optimality and also to simplify updates is the notion of value functions. The state value function of state $s$ under the current policy $\pi$ is defined as

V_{\pi}(s):=\mathbb{E}_{\pi}\left[G_{t}|s_{t}=s\right].

(5)

It describes the expected returns when starting in state $s$ and following policy $\pi$ from there on, with the value for a terminal state always zero by definition. It can be interpreted as a measure of how good it is to be in a certain state, where quality is measured w.r.t. expected return. Explicitly separating the first step in the definition above gives rise to the Bellman (expectation) equation

V_{\pi}(s)=\sum\limits_{a\in\mathcal{A}}\pi\left(a|s\right)\sum\limits_{s^{% \prime}\in\mathcal{S},r\in\mathcal{R}}p\left(s^{\prime},r|s,a\right)\left[r+% \gamma\cdot V_{\pi}(s^{\prime})\right],

(6)

for all $s\in\mathcal{S}$ . Consequently, the value function $V_{\pi}$ can be viewed as the unique solution to this Bellman equation. Alternatively, one can define the state-action value function as the expected return when starting in state $s$ , executing action $a$ , and following policy $\pi$ from there on. It is defined as

Q_{\pi}(s,a):=\mathbb{E}_{\pi}\left[G_{t}|s_{t}=s,a_{t}=a\right],

(7)

for all $s\in\mathcal{S}$ and $a\in\mathcal{A}$ . It is straightforward to see that it holds $V_{\pi}(s)=\sum_{a\in\mathcal{A}}\pi\left(a|s\right)Q_{\pi}(s,a)$ for all $s\in\mathcal{S}$ . This identity can be used to give the Bellman equation for the state-action value function as $Q_{\pi}(s,a)=\sum_{s^{\prime}\in\mathcal{S},r\in\mathcal{R}}p(s^{\prime},r|s,a% )\left[r+\gamma\cdot\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}|s^{\prime})Q% _{\pi}(s^{\prime},a^{\prime})\right]$ .

The value function allows to explicitly define and evaluate the quality of policies, i.e., the policy $\pi$ is better or equal to another policy $\pi^{\prime}$ , iff $V_{\pi}(s)\geq V_{\pi^{\prime}}(s)$ for all $s\in\mathcal{S}$ . If a policy is better or equal to all others, it is considered an optimal policy $\pi^{*}$ . All optimal policies share the same optimal state-value function

V_{\pi^{*}}(s):=V^{*}(s):=\underset{\pi}{\max}\leavevmode\nobreak\ V_{\pi}(s),

(8)

for all $s\in\mathcal{S}$ . A similar notion of optimality for the action-value function is given by

Q^{*}(s,a):=\underset{\pi}{\max}\leavevmode\nobreak\ Q_{\pi}(s,a),

(9)

for all $s\in\mathcal{S}$ and $a\in\mathcal{A}$ . It is straightforward to formulate the connection of both quantities as $V^{*}(s)=\underset{\pi}{\max}\leavevmode\nobreak\ \left(\sum_{a\in\mathcal{A}}% \pi\left(a|s\right)Q_{\pi}(s,a)\right)=\underset{a\in\mathcal{A}}{\max}% \leavevmode\nobreak\ Q^{*}(s,a)$ . With this it is possible to derive the Bellman optimality equation for the value function as

V^{*}(s)=\underset{a\in\mathcal{A}}{\max}\leavevmode\nobreak\ \sum\limits_{s^{% \prime}\in\mathcal{S},r\in\mathcal{R}}p\left(s^{\prime},r|s,a\right)\left[r+% \gamma\cdot V^{*}(s^{\prime})\right],

(10)

for all $s\in\mathcal{S}$ . Using the stated connection this can be reformulated to extend to the state-action value function as $Q^{*}(s,a)=\sum_{s^{\prime}\in\mathcal{S},r\in\mathcal{R}}p(s^{\prime},r|s,a)% \left[r+\gamma\cdot\underset{a^{\prime}\in\mathcal{A}}{\max}\leavevmode% \nobreak\ Q^{*}(s^{\prime},a^{\prime})\right]$ for all $s\in\mathcal{S}$ and $a\in\mathcal{A}$ .

Solving and Approximating the Bellman Equation

One topic that has to be addressed is the actual representation of the policy and value functions. The most intuitive approach is to just store the values for all state-action pairs in a table, also referred to as the tabular approach. While this formulation offers nice convergence and optimality guarantees for several scenarios, it has some serious drawbacks. Most prominently, it is intractable once the state-action space gets to large, which is the case for most real-world problems. A workaround is to use parametric function approximators, which results in the parameterized functions $\pi_{\theta}$ , $V_{\theta}$ , or $Q_{\theta}$ , respectively. The typical choice is a NN [HSW89], in Sec. 4.2 the usage of VQCs for this task is considered from several angles. As there now is an approximation in the defining quantities, also convergence guarantees are much less straightforward than for the tabular case. The remaining parts of this section can be understood both for the tabular and parameterized case, although details might vary a bit.

The Bellman optimality equation offers a tool to derive an optimal policy. It has to be noted that the given formulation makes use of the environment dynamics $p$ . Therefore, solution methods solving the equation with dynamic programming are referred to as model-based. The two most prominent examples include value iteration [Bel57] and policy iteration [RN94, PRD96].

There is also a whole range of model-free approaches, where the agent does not make use of any model that represents the environment dynamics. Instead, all information is directly acquired by interaction with the environment. One prominent representative is the $Q$ -learning approach [WD92], which basically is an approximation of $Q$ -value iteration using samples. Starting with a random initialization, the update rule

Q(s,a)\leftarrow Q(s,a)+\alpha\left(r_{t}+\gamma\cdot\underset{a^{\prime}\in% \mathcal{A}}{\max}\leavevmode\nobreak\ Q(s^{\prime},a^{\prime})-Q(s,a)\right)

(11)

directly derives from the Bellman equation, where $\alpha$ is a learning rate hyperparameter. The policy is usually defined to act epsilon-greedily w.r.t. the current action-value function, i.e.

\pi(s):=\begin{cases}\underset{a\in\mathcal{A}}{\arg\max}\leavevmode\nobreak\ % Q(s,a)&\text{with probability }1-\varepsilon,\\ \text{uniformly at random from }\mathcal{A}&\text{with probability }% \varepsilon.\end{cases}

(12)

An alternative approach is the policy gradient idea [Sut+99], which directly aims to learn the policy. Based on an parameterized policy $\pi_{\theta}$ , it performs updates

\theta\leftarrow\theta+\alpha\nabla_{\theta}J(\theta)

(13)

via gradient ascent, where $J(\theta)$ is a performance measure, usually $J(\theta)=V_{\pi_{\theta}}(s_{0})$ . Unfortunately the desired gradient likely depends on some environment dynamics, which are not known. The policy gradient theorem [Sut+99] describes a quantity proportional to $\nabla_{\theta}V_{\pi_{\theta}}$ , which is easier to obtain. It is given by

\nabla_{\theta}V_{\pi_{\theta}}(s_{0})\propto\sum\limits_{s\in\mathcal{S}}\mu(% s)\sum\limits_{a\in\mathcal{A}}Q_{\pi_{\theta}}(s,a)\nabla_{\theta}\pi_{\theta% }\left(a|s\right),

(14)

where $\mu(s)$ is a function that expresses the fraction of time that is spend in state $s$ . An concrete instance of this idea is the REINFORCE algorithm [Wil92], where a Monte Carlo method is used to estimate the quantity described in the equation above. Furthermore, the training procedure can be stabilized by introducing a suitable baseline function that reduces the variance of the expected return [Zha+11].

Overall, there are several extensions and modifications of the described concepts. One method worth mentioning is the actor-critic approach [KT03], which combines ideas form policy gradient and value functions. As for smaller modifications, there is double $Q$ -learning, which introduces an additional target action-value function to reduce some bias caused by the standard $Q$ -learning procedure [Has10]. Similarly, the introduction of an experience replay buffer [Lin92] should improve stability and sample efficiency. This finally leads to offline or batch RL [EGW05], where the agent is not allowed to directly interact with the environment. Instead, it only has access to a set of previously collected experiences. This formulation is especially relevant in practice, as generating data is sometimes quite expensive. There is still a wide range of topics this summary did not touch. Where necessary, additional details will also be introduced in the upcoming chapters. For a more broad introduction to the topic one can refer to Ref. [SB18], more recent developments are e.g. reviewed in Refs. [Aru+17, NLH20].

3 The Quantum Computing Paradigm

The foundations of QC were established at the beginning 20th century when the modern theory of quantum physics was developed. Benioff and Feynman proposed the idea of taking advantage of quantum mechanical systems for computing in the early 1980s [Ben80, Fey82]. QC challenges the strong Church-Turing hypothesis, as it potentially provides efficient solutions to classically intractable problems [NL16]. This section gives a pragmatic introduction to the basics of QC, and also provides an extension to quantum machine learning (QML) (here understood as ML with VQCs as a new class of models) with a focus on QRL.

Single and Multi-Qubit Systems

Similar to RL, notation and conventions regarding quantum computing vary quite a bit throughout the literature. Regarding notation, we closely follow the textbook by Nielsen and Chuang [NL16].

For the moment, let us consider the basic unit of information for classical information processing. A single bit is either in state $0$ or state $1$ , consequently, a sequence of $n$ bits can represent $2^{n}$ unique values. Obviously, the bit register can only be in one of these $2^{n}$ states at any point in time.

A qubit is the quantum version of a bit. We use the Dirac notation [NL16] to define $\ket{0}$ and $\ket{1}$ as two distinct, orthogonal states of the qubit system. These basis states span a $2$ -dimensional Hilbert space $\mathcal{H}\cong\mathbb{C}^{2}$ , which contains all $1$ -qubit (pure) quantum states. The qubits are subject to the laws of quantum mechanics and can be realized with, e.g., spin systems of subatomic particles [PMV02], ion traps [BBA14], neutral atoms [SWM10], or superconducting circuits [YN06]. This gives rise to some interesting properties. In fact, a qubit can not only be in either state $\ket{0}$ or $\ket{1}$ , but in a superposition of both. An arbitrary $1$ -qubit state is given as

\displaystyle\ket{\psi}=\alpha\ket{0}+\beta\ket{1}.

(15)

The amplitudes $\alpha$ and $\beta$ are complex numbers, which must satisfy $\absolutevalue{\alpha}^{2}+\absolutevalue{\beta}^{2}=1$ . To get a nice visual representation, Eq. 15 can be reformulated as

\displaystyle\ket{\psi}=e^{i\gamma}\left(\cos\frac{\theta}{2}\ket{0}+e^{i\phi}% \sin\frac{\theta}{2}\ket{1}\right),

(16)

with $\gamma,\leavevmode\nobreak\ \theta,\leavevmode\nobreak\ \phi\in\mathbb{R}$ . As any global phase has no observable effect [NL16], the prefactor $e^{i\gamma}$ in Eq. 16 can be omitted. This representation makes it possible to visualize the state of a $1$ -qubit system on the surface of the Bloch sphere, see Fig. 3. The north and south poles w.r.t. the $z$ -axis correspond to the basis states $\ket{0}$ and $\ket{1}$ , which are also referred to as computational basis states of a single qubit. Another, less commonly used basis is given by the poles w.r.t. the $x$ -axis, the elements are related by $\ket{+}=\frac{\ket{0}+\ket{1}}{\sqrt{2}}$ and $\ket{-}=\frac{\ket{0}-\ket{1}}{\sqrt{2}}$ . Similarly, one could also use $\ket{R}=\frac{\ket{0}+i\ket{1}}{\sqrt{2}}$ and $\ket{L}=\frac{\ket{0}-i\ket{1}}{\sqrt{2}}$ . An alternative representation associates quantum states with amplitude vectors:

\displaystyle\ket{0}\to\begin{bmatrix}1\\ 0\end{bmatrix}\text{ and }\ket{1}\to\begin{bmatrix}0\\ 1\end{bmatrix}

(21)

Multiple-qubit systems are the point where things get interesting. An $n$ -qubit system gives access to the $2^{n}$ -dimensional Hilbert space, in which an arbitrary pure quantum state is defined as

\ket{\psi}=c_{0}\ket{00\cdots 00}+c_{1}\ket{00\cdots 01}+\cdots+c_{2^{n}-1}% \ket{11\cdots 11},

(22)

with $c_{i}\in\mathbb{C}$ and $\sum_{i=0}^{2^{n}-1}\absolutevalue{c_{i}}^{2}=1$ . The basis states, e.g. $\ket{00\cdots 01}=\ket{0}\otimes\ket{0}\otimes\cdots\otimes\ket{0}\otimes\ket{1}$ , consist of tensor products of the individual qubits. The state $\ket{\psi}\to\left[c_{0},\leavevmode\nobreak\ c_{1},\leavevmode\nobreak\ % \cdots,\leavevmode\nobreak\ c_{N-1}\right]^{t}$ possesses $N=2^{n}$ complex amplitudes, whose absolute squared values must sum up to one. Due to the principle of superposition, an $n$ -qubit system is able to encode and process information scaling in $\mathcal{O}\left(2^{n}\right)$ , while for a classical setting, it is limited to $\mathcal{O}\left(n\right)$ .

Evolution of Closed Quantum Systems

In order for computation to be possible, there must be some method to manipulate quantum states. Exactly this is achieved by operators acting on the Hilbert space $\mathcal{H}$ . By definition, all operators, which describe the time evolution of a closed quantum system are reversible. Hence, they can be represented as unitary matrices, i.e., for an operator $U$ it must hold that $U^{\dagger}U=I$ . This constraint also conveys length preserving properties, i.e., applying a unitary operator to a quantum state will again yield a valid quantum state satisfying Eq. 22.

In the following, explicit matrix representations of operators are specified in the computational basis. Starting simple, consider the bit-flip operator $\sigma_{x}$ . This operator just flips the amplitudes of the $\ket{0}$ and $\ket{1}$ basis state, on the Bloch sphere this is equivalent to a rotation by $\pi$ about the $x$ -axis. The corresponding operators also exist for $y$ -axis and $z$ -axis, in matrix notation those are given as

\displaystyle X:=\sigma_{x}=\begin{bmatrix}0&1\\ 1&0\end{bmatrix},\leavevmode\nobreak\ \leavevmode\nobreak\ Y:=\sigma_{y}=% \begin{bmatrix}0&-i\\ i&0\end{bmatrix},\leavevmode\nobreak\ \leavevmode\nobreak\ Z:=\sigma_{z}=% \begin{bmatrix}1&0\\ 0&-1\end{bmatrix}.

(29)

Allowing an additional degree of freedom, one can define an operator for arbitrary rotation with $\theta$ about axis $i$ as

\displaystyle R_{i}(\theta)=e^{-i\frac{\theta}{2}\sigma_{i}},\leavevmode% \nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \text{for }i\in\{x,y,z\}.

(30)

The last $1$ -qubit operator we introduce is the Hadamard matrix:

\displaystyle H=\frac{1}{\sqrt{2}}\begin{bmatrix}1&1\\ 1&-1\end{bmatrix},

(33)

which basically performs a change of basis with $H\ket{0}=\ket{+}$ and $H\ket{1}=\ket{-}$ . By employing the tensor product for operators, we can extend $1$ -qubit operators to act on single qubits comprising a multi-qubit system. We now move to genuine multi-qubit operators, acting non-trivially on two or more qubits. For our purposes, the most relevant $2$ -qubit operators are the controlled $X$ ( $CX$ ) and controlled $Z$ ( $CZ$ ), where one qubit acts as the control and the other as the target. More concretely, the $CX$ -gate flips the amplitudes of the target qubit, iff the control is in state $\ket{1}$ . Similar to this, the $CZ$ operator performs a conditional phase flip. The matrix notations are given by

\displaystyle CX=\begin{bmatrix}1&0&0&0\\ 0&1&0&0\\ 0&0&0&1\\ 0&0&1&0\end{bmatrix}\text{\leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ and \leavevmode\nobreak\ \leavevmode\nobreak\ % \leavevmode\nobreak\ }CZ=\begin{bmatrix}1&0&0&0\\ 0&1&0&0\\ 0&0&1&0\\ 0&0&0&-1\end{bmatrix}.

(42)

Quantum circuit diagrams are a nice way to visualize what is going on in a quantum algorithm. The individual qubits are represented as wires, where the order of operators, also called gates, is defined by their relative position. To be more precise, the top wire gets associated with the leftmost qubit. A few common circuit symbols for the operators introduced so far are depicted in Fig. 4.

Extracting Classical Information via Measurements

In classical computing, it is trivial to observe the exact states of all bits. For quantum systems, in order to extract information, an observable quantity has to be measured. To build the bridge to quantum computing, for each physical observable there exists a Hermitian operator $O$ [NL16], i.e., it holds $O^{\dagger}=O$ . The eigenstates of $O$ define a basis of the quantum system’s Hilbert space.

Once an observable $O$ is measured, the corresponding measurement device outputs an eigenvalue of $O$ . The post-measurement state of the system is given by the eigenstate corresponding to the eigenvalue that is measured. The most commonly used observable might be Pauli- $Z$ , which corresponds to a measurement in the computational basis for a single qubit, see also Eq. 29. It has eigenvalues $\lambda_{1}=+1$ , $\lambda_{2}=-1$ and corresponding eigenstates $v_{1}=\left[1\leavevmode\nobreak\ \leavevmode\nobreak\ 0\right]^{t}$ , $v_{2}=\left[0\leavevmode\nobreak\ \leavevmode\nobreak\ 1\right]^{t}$ .

The consequences for quantum computing are quite sobering, as observing superpositions w.r.t. the basis defined by the observable is impossible. Rather, one of the postulates of quantum mechanics states the Born rule, which defines a probabilistic relationship between quantum state and measurement output. Let $\ket{0},\leavevmode\nobreak\ \ket{1},\leavevmode\nobreak\ ...,\leavevmode% \nobreak\ \ket{N-1}$ be the basis defined by observable $O$ and $c_{0},\leavevmode\nobreak\ c_{1},\leavevmode\nobreak\ ...,\leavevmode\nobreak% \ c_{N}$ the corresponding amplitudes of state $\ket{\psi}$ expressed in this basis. It holds, that measuring $O$ will result in the measurement outcome $\lambda_{i}$ with probability $\absolutevalue{c_{i}}^{2}$ . Consequently, having obtained $\lambda_{i}$ , the post-measurement state of the system is $\ket{i}$ .

The first algorithm claiming provable quantum advantage, i.e., an improvement w.r.t. some complexity metric compared to any classical approach, was published in 1992 by Deutsch and Josza [DJ92] for a specially constructed problem. Most famous might be Shor’s algorithm [Sho97], which provides an exponential speedup for tasks like prime factorization. Unfortunately, it requires large-scale, fault-tolerant and error-corrected quantum computers. All current hardware can be considered NISQ devices, which makes the execution of these algorithms infeasible. Despite this, the first claim of experimental quantum advantage was published just two years ago [Aru+19]. Yet, the considered problem was quite far from general practical applicability. A demonstration for achievable quantum supremacy on a practically relevant problem has still to be given. There are some promising candidates like quantum chemistry and material science. Recently, ideas have been put forward on combining quantum computing and machine learning [Ben+20, SP18]. These algorithms are expected to bypass at least some of the problems with execution on presently available NISQ hardware.

Quantum Machine Learning with Variational Quantum Circuits

The research on QML just really took off in the last two decades, yet there exists already a variety of approaches. As a rough clue, the hoped-for benefit of QML relies, to a large extent, on the access to the high dimensional Hilbert space granted by quantum systems. Here, we want to briefly collect the background for the summaries of VQC-based QRL approaches in Sec. 4.2.

QML frequently deals with expectation values of quantum measurements. The expectation value of an observable $O$ w.r.t. the quantum state $\ket{\psi}$ is denoted as

\displaystyle\expectationvalue{O}_{\psi}:=\expectationvalue{\psi\left|O\right|% \psi}.

(43)

While VQCs define a new class of ML models, one can make the case for the loose analogy to NNs, where the relation of in- and output depends on a set of weights. An example for a parameterized quantum operator is given in Eq. 30. The corresponding gate applies a rotation about a specific axis by some angle $\theta_{0}$ . Multiple rotation gates form a quantum circuit, where $\boldsymbol{\theta}$ summarizes all free parameters. Varying these values gives the possibility to determine the evolution of the quantum system. Let $U_{\theta}$ denote the corresponding unitary. An schematic example of a VQC is displayed in Fig. 5. Most RL tasks use the concept of states, based on which an informed decision should be taken. This state information is encoded into the quantum system with an appropriate feature map. In general, the inputs $s$ are pre-processed with some map** function $\Phi$ . The results $\Phi(s)$ can be neatly integrated into the quantum circuit via the unitary $U_{\Phi(s)}$ . To enhance the expressive power of the VQC, one can use more sophisticated data encoding routines like data re-uploading [Pér+20] or incremental data-uploading [Per+22]. Eventually, some observable has to be measured. A common choice is the computational basis with $O=Z^{\otimes n}$ . Overall, the output of the VQC-model can be described as

	$\displaystyle\expectationvalue{O}_{s,\theta}=$	$\displaystyle\expectationvalue{0\left\|\left(U_{\theta}U_{\Phi(s)}\right)^{% \dagger}OU_{\theta}U_{\Phi(s)}\right\|0}$		(44)
	$\displaystyle:=$	$\displaystyle\expectationvalue{0\left\|U_{s,\theta}^{\dagger}OU_{s,\theta}% \right\|0}.$		(44)

For most tasks, this value is post-processed using some function $f$ . Kee** things as general as possible, one can define a loss function $\mathcal{L}$ on $f\left(\expectationvalue{O}_{s,\theta}\right)$ (based on the concrete problem at hand). The update of the parameters can be performed using, e.g., gradient-based techniques:

\displaystyle\theta\leftarrow\theta+\alpha\cdot\nabla_{\theta}\mathcal{L}\left% (f(\expectationvalue{O}_{s,\theta})\right)

(45)

The required gradient can be obtained using the parameter-shift rule [Cro19, Wie+22a], or SPSA-based approximations [Wie+23].

4 Quantum Reinforcement Learning Algorithms

In QML, there are approaches that either aim to stabilize the coherent function of the QPU using ML methods, or use the structure of a hybrid variational algorithm for ML purposes. Very often, RL is used to generate a solution for a quantum control problem, e.g., to learn quantum error correction strategies [Fös+18] or to generate control policies at a lower level [Zha+19, Dal+20]. Other work considers RL as the optimizer of a variational quantum algorithm (VQA) [Kha+19, Kha+20]. While this represents a fascinating research topic in itself, here we will focus on the application of QRL algorithms for solving specific tasks, be they classical or quantum. Research in the field of QML has so far mostly focused on supervised and unsupervised learning. However, the literature already proposes quite a few theoretical concepts and even some small-scale experimental realizations for QRL. Recent developments mostly focus on employing VQCs as function approximators. When transferring from RL to QRL, i.e., the ‘quantization’ of the RL paradigm, there are various possibilities of how quantum computing enters the game. This has led to the development of different QRL variants. A few works exist, that review current progress in QRL [KSG21, ML22, Kun22, Lam23, NHP23] and the more general correspondence of RL and QC [ML21]. There is also recent work towards a fair comparison of RL and QRL in restricted settings [MK21, Fra+22].

Quantum-Inspired Approaches. The earliest idea for combining RL with a quantum routine relies on the method of amplitude amplification, as it is used in Grover-type search algorithms [CDC06, Don+08a, Don+06a, Che+06, Don+06, CD08, Don+08, CD10, CFD12, Fak+13, NGC15, Li+20, Nir+21, LAD21, Yin+21, Hu+21, Ren+22, Cho+23]. Several qubit registers embed the states and actions relevant for the RL system in a suitable Hilbert space. Starting from a uniform superposition, amplitudes favored by the reward or the value function are selectively amplified. The action selection is based on Born’s rule, i.e., a measurement is carried out on the qubit register with regard to the ‘action-basis’. The algorithm was also investigated independently of QPUs [Don+12] and recently further developed [GH19]. An introduction to this concept is also provided in Ref. [Raj+21]. As it turns out, these early variants should rather be considered a set of QiRL algorithms, that do not offer an intrinsic potential for quantum advantage. Recently, the technique was transferred to sampling from the experience replay buffer in $Q$ -learning [Wei+21]. A summary and review of this type of QiRL can be found in Sec. 4.1.

VQC-Based Function Approximation. In DRL, deep neural networks (DNNs) are employed as powerful function approximators. Typically, the approximation either happens in policy space (actor), in value space (critic), or both, resulting in so-called actor-critic approaches. Recently, VQCs were proposed and analyzed in their role as function approximators in the RL setting – an extensive overview is provided in Sec. 4.2. On the one hand, this approach basically replaces a more or less well understood heuristic with a poorly understood heuristic. For the quantum heuristic many open questions regarding computational power, scalability and trainability remain. On the other hand, VQCs nonetheless have spurred the hope for quantum advantage already with NISQ devices. The earliest work in this direction proposed VQC-based approximation in value space, which is covered in Sec. 4.2.1. This so-called VQC-based $Q$ -learning was introduced in Ref. [Che+20], and extended in Refs. [LS20, LS21, Lok+22, Che23b, CCC23, FP+23, SJD22, Sko+23, LXJ23]. A method to efficiently evaluate the $Q$ -function is discussed in Ref. [San+23], which is however not entirely NISQ-feasible. The complimentary approach of approximation in policy space is discussed in Sec. 4.2.2. Originally proposed in Ref. [Jer+21], several extensions have bee discussed in Refs. [Kun22, BAQ23, SSB23, Jer+23, Mey21, Mey+23a, Mey+23]. Combinations of value and policy approximation are covered in Sec. 4.2.3, with (soft) actor-critic approaches in Refs. [Wu+23, Kwa+21, Ree23, Che23, Lan21], and multi-agent formulations in Refs. [Yun+22, YPK23]. The setting of offline quantum reinforcement learning is considered in Sec. 4.2.4 by Refs.[Per+23, Che+23]. A collection of algorithmic and conceptual extensions that are relevant for a wide range of approaches is composed in Sec. 4.2.5, based on Refs. [Che23c, Che23a, Kim+21, Hsi+22, Dră+22, Kru+23, SMT23, ACN23, PPR20, Che+22, DS23, Köl+23]. A collection of application-focused work is summarized in Sec. 4.2.6, comprising Refs. [Acu+22, Hei+22, Cob23, BYK22, SMK23, Hic+23, KCP23, Cor+23, San+22, ACN22, Liu+23, Kum+23, Rai+23, SH23, RKM22, Yan+22, Par+23, NS+23, Par+23a, PK23, Yun+23, Ans+23, Che+23b, Yan23, CRC23, Che23d].

Projective Simulation. Another QRL method is based on projective simulation (PS), which in the broadest sense is a particular learning paradigm and similar in spirit to RL [BD12]. Based on experiences made through interaction with the environment, a memory network is created by the agent. The network has a directed structure with adaptive weights between the nodes of the network. The learning process and action selection are based on a random process (more precisely, a random walk) on the graph of the network, with the transition probabilities between nodes being given by the respective adaptive weights. PS can be ‘quantized’ by replacing the random walk with a so-called quantum random walk [Pap+14, Tei21, TRC21, Mel+17]. A formal analysis of convergence properties was given in Ref. [Boy+20]. In fact, there is already work on a proof-of-principle implementation in the laboratory [DFB15, Sri+18] and proposals for quantum-optics implementations [Fla+23]. Possible quantum advantages over classical PS lie in the acceleration of the process of action selection, also referred to as deliberation in the literature. A more detailed summary is provided in Sec. 4.3.

Quantum Boltzmann Machines. Another line of research proposes to use Boltzmann machines as function approximators. These models are assumed to be advantageous compared to typical NNs in environments with large action spaces. Ref. [Jer+21a] demonstrates, that Boltzmann machines are closely related to energy-based models. For specific instances, those allow for a quantum representation, which enables potential quantum speed-up for post-NISQ devices. A similar concept is also proposed for the annealing-based QC paradigm [Cra+18, Sch+22, Lev+17]. A summary of these ideas can be found in Sec. 4.4.

Quantum Subroutines. Another approach to go from RL to QRL replaces certain subroutines in existing RL approaches. One idea is to replace policy or value iteration with some quantum-enhanced analogues. While this approach is limited to universal, fault-tolerant and error-corrected quantum hardware, several such algorithms have been proposed and analyzed [Wie21, Wie+22, Wan+21, CKP23, Gan+23, Zho+23, GA23]. Most importantly, these algorithms come with guarantees regarding speed-up, compared to their classical counterparts. QRL in these settings is often limited to the tabular case and assumes a quantum version of the RL environment, i.e., oracle access. Our summaries and reviews can be found in Sec. 4.5.

Full-Quantum Formulation. An approach which not only ‘quantizes’ certain subroutines, but all components of the pipeline, is considered in Refs. [DTB15, DTB16, DTB17, Dun+18]. Extensions [HDW21, HW22], applied to specific problems [Wan+21a, Wan+23], and small-scale experimental realizations [Sag+21] were presented. An alternative route to fully quantized QRL was taken in [Cor18]. For our review of this line of research, see Sec. 4.6.

Various Concepts. For the sake of completeness, we mention different approaches found in the literature. We note, however, that we did not pursue a detailed review for those works, typically because we focused on what we identified as the most considered lines of research. While some of the works listed in the following simply do not fit directly with the learning-based QRL approach, for others it might not seem obvious how to generalize their particular setting to a broader class of problems. While quantum algorithms for dynamic programming have been discussed [Ron19, Amb+19], it currently remains unclear how to move from dynamic programming to a learning-based approach such as RL. Similarly, quantum algorithms have been employed to solve planning tasks [NW05], but again the transfer to a learning-based approach is far from obvious. Closer related to the typical RL setting is the task of imitation learning [Che+23a]. A series of papers discussed QRL in the setting of photonic circuits, see and Refs. [HH19, HH19b, HH19a, HH19c, SH20] and Refs. [Fla+20, Lam21, Sag+21a, Nag+21, Shi+22], with the connection to superconducting qubits established in Ref. [Lam17, Cár+18]. Another approach, which we did not review in detail, is given by combining RL with the paradigm of quantum annealing [Neu+17, AHF20, Neu+20, Mül+21, FH23, NY23]. Strategies have been developed to address the classical and quantum version of contextual bandits [LHT22, LJW22, BLT23, BKS23]. Furthermore, a quantum version of the classical RL benchmark environment CartPole has been formulated [WAU20, Mei+23]. Similarly, various interpretations of QRL for specialized tasks in the quantum domain exist [Alv+16, Alv+18, Bha+19, Alb+18, Alb+20, She+20, Oli+20, Liu+22, ÇY23]. Different approaches have been proposed for combining RL with quantum walks [Che+19, Dal+22, MVB22]. Further work on optimization tasks rather than RL, such as Ref. [Ram17, Jaš+19, Bel+20], have not been reviewed in detail. An interesting interpretation of self-learning physical machines is discussed in [LM23], which potentially could be brought into line with QRL.

4.1 Quantum-Inspired Reinforcement Learning based on Amplitude Amplification

Citation	First Author	Title
[Don+08a]	D. Dong	Quantum reinforcement learning
[Don+06]	D. Dong	Quantum mechanics helps in learning for more intelligent robots
[CDC06]	C.-L. Chen	Quantum computation for action selection using reinforcement learning
[Don+06a]	D. Dong	Quantum Robot: Structure, Algorithms and Applications
[Che+06]	C.-L. Chen	Superposition-Inspired Reinforcement Learning and Quantum Reinforcement Learning
[CD08]	C.-L. Chen	A Quantum Reinforcement Learning Method for Repeated Game Theory
[Don+08]	D. Dong	Incoherent Control of Quantum Systems With Wavefunction-Controllable Subspaces via Quantum Reinforcement Learning
[CD10]	C.-L. Chen	Complexity analysis of Quantum reinforcement learning
[Don+12]	D. Dong	Robust Quantum-Inspired Reinforcement Learning for Robot Navigation
[CFD12]	C. Chunlin	Hybrid control of uncertain quantum systems via fuzzy estimation and quantum reinforcement learning
[Fak+13]	P. Fakhari	Quantum inspired reinforcement learning in changing environment
[NGC15]	S. Nuuman	A quantum inspired reinforcement learning technique for beyond next generation wireless networks

Table 2: [Part 1] Work considered for “QiRL based on amplitude Amplification” (Sec. 4.1)

Citation	First Author	Title
[GH19]	M. Ganger	Quantum Multiple Q-Learning
[Li+20]	J.-A. Li	Quantum reinforcement learning during human decision-making
[LAD21]	Y. Li	Intelligent Trajectory Planning in UAV-Mounted Wireless Networks: A Quantum-Inspired Reinforcement Learning Perspective
[Raj+21]	K. Rajagopal	Quantum Amplitude Amplification for Reinforcement Learning
[Nir+21]	D. Niraula	Quantum deep reinforcement learning for clinical decision support in oncology: application to adaptive radiotherapy
[Wei+21]	Q. Wei	Deep Reinforcement Learning With Quantum-Inspired Experience Replay
[Yin+21]	L. Yin	Quantum deep reinforcement learning for rotor side converter control of double-fed induction generator-based wind turbines
[Hu+21]	Y. Hu	Quantum-enhanced reinforcement learning for control: a preliminary study
[Ren+22]	Y. Ren	NFT-Based Intelligence Networking for Connected and Autonomous Vehicles: A Quantum Reinforcement Learning Approach
[Cho+23]	B. Cho	Quantum bandit with amplitude amplification exploration in an adversarial environment

Table 3: [Part 2] Work considered for “QiRL based on amplitude Amplification” (Sec. 4.1)

Quantum reinforcement learning, Dong et al. (2008) and related work

Summary. Ref. [Don+08a] discusses a new RL algorithm that is inspired by the superposition principle of quantum mechanics. The authors propose an algorithm that modifies the action-selection procedure and balances exploration and exploitation in a novel way. The authors present their ideas in modified form in a sequence of papers, see Refs. [Don+08a, Don+06, Don+06a, Che+06, CD08, CDC06, Don+08, CD10, Don+12, CFD12, Fak+13, NGC15, GH19, Li+20, LAD21, Hu+21], for an overview see also [Raj+21]. The original work [Don+08a] discusses how to execute the proposed algorithm on actual quantum devices – which, however, did not exist at this time. As discussed also below, it is not clear how to run the algorithm in quantum superposition, and if this is possible in practice without taking away potential quantum advantage. Despite these doubts the proposed concepts enhance classical RL with ideas from QC, which leads us to view this approach as QiRL.

Algorithmic Concepts and Extensions. Initially, the algorithm is formulated as merely quantum inspired in Ref. [CD08] (i.e., it is developed for a classical computer that simulates a quantum superposition). The motivation is to design an algorithm with better exploration-exploitation trade-off compared to e.g. $\epsilon$ -greedy action selection. The underlying routine is a modification of temporal difference (TD), more concretely TD (0) in the following way: For each state the set of possible actions is in a ‘superposition’ and the agent (in state $s$ ) now selects an action with a given probability. The action is taken and the new state $s^{\prime}$ and reward $r$ is observed. Afterwards, the probability of the taken action is increased by $k(r+V(s^{\prime}))$ , where $V(s^{\prime})$ is the value function of state $s^{\prime}$ , and $k$ is a hyperparameter. The term $r+V(s^{\prime})$ samples a quantity similar to $Q(s,a)$ . Consequently, the update creates a probability distribution, where for a given state the probability to select an action increases as the value of $Q(s,a)$ increases. Therefore, this action selection process corresponds to sampling from a stochastic policy dependent on the value of the state-action pairs.

Now the algorithm is translated to be run on a quantum computer. The stochastic policy is replaced by a quantum superposition. That is, for each state $s$ the possible actions are represented by the eigenstates of some observable and a superposition of these states is created. If the observable is measured, the state will collapse to an eigenstate associated with an action which will be taken by the agent and therefore constitutes the selection process. After receiving the reward and the new state, the Grover operator is applied $L=\mathrm{min}\{k(r+V(s^{\prime})),L_{\mathrm{max}}\}$ times to a copy of the superposition state to enhance the amplitude corresponding to the previous selected action. The variable $L_{\mathrm{max}}$ guarantees that the Grover operator is not applied too many times. Note that repeated application of the procedure requires a new copy of the state after each measurement. Due to the no cloning theorem, this could be realized by many different independent copies of the initial memory, or by a purely classical representation of the states. The latter realization reduces the algorithm to the initial proposal of a quantum-inspired action selection process.

In Ref. [Don+12] the QiRL algorithm is applied to robot navigation. It is stated explicitly that QiRL is a classical action-selection method that differs from the ideas of QRL, which in principle could benefit from a quantum computer. In Ref. [GH19] the algorithms are generalized to $Q$ -learning and double- and multiple $Q$ -learning. Also these approaches should be understood in the context of QiRL. Finally, Refs. [Li+20, Nir+21] apply QiRL to human decision making behavior, Ref. [Yin+21] to a complex control task, and Ref. [Ren+22] to autonomous vehicles. Recently, the quantum-inspired approach to action selection in RL was transferred to experience replay buffer sampling in $Q$ -learning [Wei+21].

Remarks. Although it is mentioned in Ref. [Don+08a, Don+06, CD08, CDC06] that the whole algorithm could be run in quantum superposition on a quantum device, no details of such kind of genuine QRL algorithm are given. Overall, it is unclear if such an algorithm might exist. Indeed, subsequent work focuses on the QiRL paradigm.

The claims made in Refs. [Don+08a, Don+06, Don+06a, CD08, CDC06, Don+08, Don+12, GH19, Li+20, LAD21, Cho+23] can be summarized as follows: speed-up in learning by better balancing exploration-exploitation; less GPU power needed on classical computer compared to algorithms like classical $Q$ -learning; more robust against changes of learning rate. More experiments on larger environments for deeper insights into the scaling of the algorithm and a rigorous complexity analysis would be an interesting topic for future work.

4.2 Quantum Reinforcement Learning with Variational Quantum Circuits

This section summarizes the state-of-the-art on VQC-based RL. Several ideas have been proposed in this field, with extensions in different directions. Their common ground is the usage of a VQC as parameterized function approximator.

The typical hybrid pipeline is summarized in Fig. 6. It was originally proposed for $Q$ -function approximation by Chen et al. [Che+20] and extended to policy approximation by Jerbi et al. [Jer+21]. Other work proposes several modifications to this pipeline, which we will describe in the respective summaries. The algorithm must be understood as hybrid, as a lot of the work, especially the optimization, is executed on classical hardware. The agent observes the current state of the environment $s_{t}$ , and applies some pre-processing $\phi$ . The result is encoded using the feature map $U_{\phi(s)}$ . With the current variational parameters $\theta_{t}$ , a quantum state is prepared and a (potentially action-dependent) observable $O_{a}$ is measured. The expectation value $\expectationvalue{O_{a}}_{s,\theta}$ can be post-processed to represent, e.g., a state-action value function $Q_{\theta}(s,a)$ , or the policy $\pi_{\theta}(a|s)$ . Depending on the instance, the agent employs this function to sample an action $a_{t}$ and executes it in the environment. The reward $r_{t}$ (and potentially also the consecutive state $s_{t+1}$ ) is observed by the classical optimizer. To enable gradient-based parameter updates, an additional hybrid module uses the parameter-shift rule [Cro19, Wie+22a] to compute the gradients of the VQC outputs w.r.t. the variational parameters $\theta_{t}$ . The classical optimizer determines the new parameter set $\theta_{t+1}$ and instantiates the VQC with these updated parameters. This overall iterative procedure of environment interaction, function approximation, and parameter update is repeated for several episodes, in the same way as for, e.g., DRL.

Unfortunately, thus far there is no guaranteed quantum advantage for this approach, apart from some cryptography inspired artificial datasets [Jer+21, SJD22]. However, several of the papers and preprints summarized in this section demonstrate promising experimental results.

4.2.1 Value-Function Approximation

This section covers VQC-based approximations in value space, as described for the instance of classical $Q$ -learning in Eqs. 11 and 12. The work by Chen et al. [Che+20] was indeed the first proposal of this type of approximation-based techniques, which was reproduced and extended in Refs. [Lok+22, CCC23, Che23b, FP+23]. A modification of the state encoding procedure has been discussed in Lockwood and Si [LS20], and was up-scaled in another work by the same authors [LS21]. A slight reformulation of the technique – which comes with a provable advantage for very specific scenarios – can be found in Skolik et al. [SJD22]. An analysis of noise influence for this framework is discussed in Ref. [Sko+23]. An extension to environments with continuous action spaces is proposed in Ref. [LXJ23]. Ideas based on amplitude amplification to efficiently evaluate the approximated $Q$ -function have been introduced in Ref. [San+23], which however can not be realized given the current hardware restrictions.

Citation	First Author	Title
[Che+20]	S. Y.-C. Chen	Variational Quantum Circuits for Deep Reinforcement Learning
[Lok+22]	S. Lokes	Implementation of Quantum Deep Reinforcement Learning Using Variational Quantum Circuits
[Che23b]	S. Y.-C. Chen	Quantum deep Q learning with distributed prioritized experience replay
[CCC23]	H.-Y. Chen	Deep-Q Learning with Hybrid Quantum Neural Network on Solving Maze Problems
[FP+23]	G. Fikadu Tilaye	Investigating the Effects of Hyperparameters in Quantum-Enhanced Deep Reinforcement Learning
[LS20]	O. Lockwood	Reinforcement Learning with Quantum Variational Circuits
[LS21]	O. Lockwood	Playing Atari with Hybrid Quantum-Classical Reinforcement Learning
[SJD22]	A. Skolik	Quantum agents in the Gym: a variational quantum algorithm for deep $Q$ -learning
[Sko+23]	A. Skolik	Robustness of quantum reinforcement learning under hardware errors
[LXJ23]	Y. Liu	Reinforcement Learning for Continuous Control: A Quantum Normalized Advantage Function Approach

Table 4: Work considered for “QRL with VQCs– Value-Function Approximation” (Sec. 4.2.1)

Variational Quantum Circuits for Deep Reinforcement Learning, Chen et al. (2020) and related work

Summary. This paper by Chen et al. [Che+20] represents the first attempt to utilize VQCs for RL. This is done in the context of using VQCs as function approximators for the state-action value function. The authors perform simulations on simple benchmark environments and report.

Hybrid Algorithm. The algorithm is inspired by deep $Q$ -learning (DQL) [Mni+15], where a DNN represents the $Q$ -function. The authors replace the DNN by a VQC. The update is performed w.r.t. the mean square error (MSE) loss function $\mathcal{L}(\theta)=\mathbb{E}[\left(r_{t}+\gamma\cdot\mathrm{max}_{a^{\prime}% }\leavevmode\nobreak\ Q_{\theta^{{}^{\prime}}}(s_{t+1},a^{\prime})-Q_{\theta}(% s_{t},a_{t})\right)^{2}]$ using, e.g., gradient descent. Additionally, experience replay and target networks (second set of parameters $\theta^{{}^{\prime}}$ ) are employed to address the instabilities stemming from bootstrap** the value function, forming a double deep $Q$ -learning (DDQL) algorithm. Fig. 7 gives the complete algorithm.

VQC Architecture. The feature map uses simple computational basis encoding on individual qubits. More concretely, the RL state is interpreted as bitstring, which can be encoded using the identity $R_{z}(\pi)R_{x}(\pi)\ket{0}=\ket{1}$ . The entanglement structure connects nearest neighbors with $CZ$ gates. The variational parameters are incorporated in single qubit rotations about the $x$ , $y$ , and $z$ axis. The state-action value is decoded by measuring Pauli- $Z$ observables on a number of qubits, that corresponds to the number of actions in the environment. The full VQC is visualized in Fig. 8.

Experimental Results and Discussion. The proposed VQC-DQL algorithm is simulated for two environments. The first one is FrozenLake, with $16$ states and an $4$ actions. The second one is CognitiveRadio, which is adapted to VQCs sizes of $2$ to $5$ qubits. The authors report that their VQC-based agent performs at least equally well as a NN. Moreover, they claim that this requires fewer parameters (about one order of magnitude compared to DNNs), which points towards potential quantum advantage. The model is tested on actual quantum hardware with competitive results.

Remarks. The employed encoding scheme (computational basis encoding) could be simplified by omitting the $R_{Z}$ rotations, as these only introduce a global phase. The CognitiveRadio environment might be oversimplified. We also note that the claim on reduced parameter count should be substantiated by experiments with environments of different scale.

Reproduction. A reproduction study by Lokes et al. [Lok+22] conducts an extended hyperparameter search for the described setup. The results and claims are overall consistent with [Che+20], but no novel findings could were reported.

Extension. In the work by S. Y.-C. Chen [Che23b] the quantum $Q$ -learning framework introduced in [Che+20] is extended by incorporating prioritized experience replay. Additionally, an asynchronous training routine is employed, similar to the one discussed in [Che23]. Both techniques reduce the overall sampling complexity and therefore allow for solving more complex tasks with the same underlying quantum model. This is validated with numerical simulations on several versions of the CartPole environment.

Hybrid Model. The work by Chen et al. [CCC23] extends the quantum models used in [Che+20] with classical neural networks, to produce more expressive function approximators. With that extension, the quantum agent is able to solve a $20\times 20$ gridworld maze, which should clearly be more complex than the originally considered FrozenLake environment. However, with the provided analysis it in unclear to which extend the performance can be contributed to the quantum part of the model.

Hyperparameter Analysis. A hyperparamter analysis is conducted by Fikadu Tilaye and Pandey [FP+23], with a focus on the $Q$ -learning framework introduced in [Che+20]. The authors conclude, that deeper quantum circuits lead to a better overall performance, while a larger learning rate speeds up the overall process. However, the analysis is superficial and quite small-scale, so further investigations are necessary to allow for more general statements.

Table 5: *

Algorithmic Characteristics - Chen et al. [Che+20] Environment Algorithm Quantum State Action Qubits Parameterized Type Component Space Space Gates¹¹1 encoding gates: $qubits\times per\_qubit$ ; variational gates: $qubits\times layers\times per\_qubit\_per\_layer$ ; FrozenLake DDQL $Q$ -function discrete discrete $4$ $4\times 2$ (encoding) (OpenAI Gym) $16$ $4$ $4\times 4\times 3$ (weights) CognitiveRadio DDQL $Q$ -function discrete discrete $n$ $n\times 2$ (encoding) (see [Che+20]) $n^{2}$ $n$ $n\times 4\times 3$ (weights)

Reinforcement Learning with Quantum Variational Circuits, Lockwood and Si (2020)

Summary. The work by Lockwood and Si [LS20] modifies several aspects of the routine proposed by Chen et al. [Che+20]. Most importantly, they introduce two new encoding schemes to deal with a continuous state space.

Modification of Architecture. The first proposed encoding is denoted as scaled encoding. It scales the RL state values to the range $[0,2\pi)$ , which are then encoded using some $1$ -qubit parameterized rotations. The second on (so-called directional encoding) only encodes the sign of the value. More concretely, if a state variable is positive, $R_{x}$ and $R_{z}$ rotations by $\pi$ are applied to the encoding qubit (following a similar idea as the computational state encoding [Che+20]).

The architecture for the variational layer consists of an entangling block (nearest-neighbor $CX$ gates) and parameterized $1$ -qubit rotations about $x$ , $y$ , and $z$ axis. This block is repeated three times. For decoding the state-action value, the authors employ two different strategies. The first one feeds the measurement result into a classical fully-connected layer where the number of outputs corresponds to the number of possible actions. In the other case, a so-called quantum pooling operation, condenses the information of the quantum state into a subset of the qubits [CCL19]. This allows for a more flexible architecture, independent of the number of actions in the environment.

Experimental Results. The proposed algorithm and the encoding schemes are benchmarked on the CartPole and Blackjack environment. While the former one uses a combination of scaled and directional encoding, the second one only employs scaled encoding. Their findings agree with those reported previously in the literature, namely that VQC-based models achieve similar performance to NN-based function approximators. As also stated by Chen et al. [Che+20], the usage of VQCs reduces the required parameter complexity.

Remarks. While the scaled encoding should be a sound choice, the directional encoding could be inappropriate for most environments. Usually, not only the sign of a specific state is relevant, but the concrete state contains relevant information. With this encoding, this information is lost, which should lead to a drop in performance for more complex environments. As stated previously, the reduced parameter complexity should be investigated for larger problem instances.

Table 6: *

Algorithmic Characteristics - Lockwood and Si [LS20] Environment Algorithm Quantum State Action Qubits Parameterized Type Component Space Space Gates¹¹1 encoding gates: $qubits\times per\_qubit$ ; variational gates: $qubits\times layers\times per\_qubit\_per\_layer$ ; CartPole DDQL $Q$ -function continuous discrete $4$ $4\times 2$ (encoding) (OpenAI Gym) $4$ -dim $2$ $4\times 3\times 3$ (weights) Blackjack DDQL $Q$ -function discrete discrete $3$ $3\times 2$ (encoding) (OpenAI Gym) $31\times 11\times 2$ $2$ $3\times 3\times 3$ (weights)

Playing Atari with Hybrid Quantum-Classical Reinforcement Learning, Lockwood and Si (2021)

Summary. This work by Lockwood and Si [LS21] extends their previous paper [LS20], which, in turn, was based on Chen et al. [Che+20], where $Q$ -learning with VQC function approximation has been introduced. The paper considers the Atari environments Pong and Breakout, with continuous state space of dimensionality $28.224$ (the observations are cropped and converted to images with $84\times 84\times 4$ pixels). This environment complexity is not tractable with previously introduced encoding schemes, which require one qubit for each dimension. The proposed workaround uses a classical NN to reduce the state dimensionality before encoding it into the VQC.

Underlying Algorithm and Simulation. Similar to Refs. [Che+20, LS20], the concept of DDQL is used. The pipeline is modified by replacing the pure VQC function approximator with a hybrid model. Several different choices are considered, the most important details are highlighted below. The training is performed in an end-to-end manner, i.e., the gradients w.r.t. the VQC parameters are propagated back through the classical encoding network.

Model Architecture. The VQC architecture is, as usually, composed of three parts (i.e. state encoding, variational layers, and action decoding). To encode the state, the raw data is first fed through a classical NN. This outputs a number of values equal to the number of parameters in the feature map, which itself consists of $1$ -qubit parameterized rotations. The authors compare the performance of a densely connected and a convolutional neural network (CNN) for this task (the concrete architecture of these networks are not specified). Apart from that, encoding layers of different sizes (and therefore different number of parameters) from $5$ to $15$ qubits are compared.

The variational layers itself consists of two parts, where the first one is a quantum convolutional neural network (QCNN) [CCL19]. The authors state two motivations for this choice: First, it should help capture the spatial structure of the input images (but it is unclear, whether the encoding part retains the spatial structure). Second, QCNNs help to avoid barren plateaus [Pes+21] (while the experiments show no sign of barren plateaus, it is not clear if this is due to this choice, or the limited size of the employed circuits). After this QCNN there are three repetitions of entanglement gates and parameterized rotations, similar to those also used for state encoding.

The paper proposes two methods to deal with the problem of measurement for unequal number of qubits and actions. The first method performs Pauli- $Z$ measurements on all qubits and uses an appended dense NN. Alternatively, quantum pooling operations [CCL19] are used, which subsequently compress the measurement of two qubits into one.

Experimental Results and Discussion. To demonstrate the basic functionality of the model, initial experiments are conducted on the CartPole environment. The results demonstrate a similar performance to Lockwood and Si [LS20]. On the two Atari environments, the paper considers $12$ different hybrid architectures (dense vs. convolutional encoding, $5$ vs. $10$ vs. $15$ qubits, dense vs. pooling decoding), which are compared to a well-established classical architecture.

It turns out, that the hybrid models are not able to learn at all. The authors state, that this is down to the lack of expressibility of the hybrid models, which only make use of about $10^{4}$ parameters, while the classical model uses about $10^{6}$ . It is expected, that for more expressive models the performance improves, as learning on the much simpler CartPole environment was successful.

Remarks. The experiments are conducted with a restricted set of hybrid models. Consequently, the claim that these results do not demonstrate the inapplicability of QRL to more complex environments like Atari is reasonable. The assumption that this approach could be made to work on complex environments, as it succeeds on e.g. CartPole, should be sustained with additional experiments. For a modified architecture succeeding on the Atari environments, it is not completely clear, which part of the work is done by the classical and quantum part of the model. This is a typical caveat, whenever quantum and classical architectures are combined.

Table 7: *

Algorithmic Characteristics - Lockwood and Si [LS21] Environment Algorithm Quantum State Action Qubits Parameterized Type Component Space Space Gates CartPole (OpenAI Gym) DDQL $Q$ -function continuous $4$ -dim discrete $2$ $5$ $N/A$ (classical)¹¹1 potentially also uses a classical NN for pre-processing, details are not stated; $\mathcal{O}\left(10^{1}\right)$ (encoding) $\mathcal{O}\left(10^{2}\right)$ (weights) Pong-v0 (OpenAI Gym) DDQL $Q$ -function continuous discrete $6$ $5$ to $15$ $\mathcal{O}\left(10^{6}\right)$ (classical) $28224$ - $\mathcal{O}\left(10^{2}\right)$ (encoding) dim²²2 dimensionality of feature space is reduced with a NN to fit size of feature map; $\mathcal{O}\left(10^{4}\right)$ (weights) Breakout-v0 (OpenAI Gym) DDQL $Q$ -function continuous discrete $4$ $5$ to $15$ $\mathcal{O}\left(10^{6}\right)$ (classical) $28224$ - $\mathcal{O}\left(10^{2}\right)$ (encoding) dim²²2 dimensionality of feature space is reduced with a NN to fit size of feature map; $\mathcal{O}\left(10^{4}\right)$ (weights)

Quantum agents in the Gym: a variational quantum algorithm for deep $Q$ -learning, Skolik et al. (2022)

Summary. This work by Skolik et al. [SJD22] proposes another instance of $Q$ -learning with VQCs as function approximators. Being aware of preceding literature, the authors set out to analyze the role of architecture design, RL state encoding schemes, and observables for action decoding. With regard to the previous work, the authors remark that the CartPole environment cannot be considered solved.

Importance of Architecture Design. In terms of architecture choices, the problem of barren plateaus is emphasized: Architectures with many qubits and layers (which naively is required for high expressivity) are hard to train. Contrarily, over-parameterized architectures are easier to train, but probably less expressive and therefore less effective on a given task.

The authors chose a hardware-efficient ansatz, despite being known to run into the barren plateau problem for large circuits. For the small circuit sizes considered in the present work, the barren-plateau problem does not appear to be relevant.

Encoding Schemes. As for encoding schemes, discrete RL states are encoded in the computational basis. Continuous states are scaled to the finite interval $[-\pi/2,+\pi/2]$ by applying $\arctan$ to the raw observations. The result serves as the rotation angle for an $R_{x}$ rotation, which is very similar to the scaled encoding proposed by Lockwood and Si [LS20]. In order to increase expressivity w.r.t. to the input, the encoding layer can be repeated through the circuit, forming a data re-uploading structure [Pér+20]. Effectively, this allows to learn and approximate a Fourier sum of a certain order, where the order is tied to the number of repetitions of the encoding layer [SSM21]. The encoding is further modified by introducing learnable re-scaling parameters, that are multiplied with the raw states before computing the $\arctan$ .

Experimental Results and Discussion. The authors benchmark their architecture choices on the FrozenLake and CartPole environment. The performance on CartPole is compared to a small NN with the same number of parameters, which seems to be inferior. Further, the range of $Q$ -values that can be encountered in the two benchmark environments is investigated. For FrozenLake, representing the $Q$ -value with the expectation values of $1$ -qubit $Z$ -operators is sufficient. For the CartPole environment, this strategy is found not to be adaptable enough. Instead, they chose the expectation values of the parities (of 2 non-overlap** pairs of qubits) and allow for additional trainable classical weights that set the scale for the $Q$ -value approximation.

Remarks. The authors emphasize the critical role of architectural choices at the outset of their manuscript. While they offer valuable insights into this topic, also open questions remain for future work in this direction. For the CartPole environment, several trainable classical weights are incorporated in the algorithm. Therefore, it is not completely clear, what part of the training is achieved by which part of the hybrid model.

Error Analysis. The work by Skolik et al. [Sko+23] analysis the influence of hardware noise on the quantum $Q$ -learning framework introduced in [SJD22], but also quantum policy gradient (QPG) approaches discussed in Sec. 4.2.2. The results are numerically validated on the CartPole environment and a version of the Travelling Salesperson Problem. The results indicate, that the performance is very much dependent on the inherent structure of the noise. For some instances, the robustness of the learned policy is actually increased if noise is encountered during training. However, e.g. for strong incoherent noise the performance decreases quite substantially. Interesting from a practical point of view is especially the analysis of shot noise, which indicates that a low number of repetitions is enough to get a reliable estimate of the $Q$ -function – an explicit algorithm to exploit this property is proposed in this work.

Continuous Action Spaces. A Q-learning approach based on [SJD22] that incorporates continuous action spaces is discussed by Liu et al. [LXJ23]. They use normalized advantage functions which allows for continuous action selection. An alternative would be to additionally use a policy function approximator to form an actor-critic approach, as discussed in Sec. 4.2.3.

Table 8: *

Algorithmic Characteristics - Skolik et al. [SJD22] Environment Algorithm Quantum State Action Qubits Parameterized Type Component Space Space Gates¹¹1 encoding gates: $qubits\times per\_qubit$ ; variational gates: $qubits\times layers\times per\_qubit\_per\_layer$ ; CartPole (OpenAI Gym) DDQL $Q$ -function continuous $4$ -dim discrete $2$ $4$ $4\times 1$ (encoding) $4\times 15\times 2$ (weights) $N/A$ (classical)²²2 model incorporates classical weights after measurement, details are not stated; FrozenLake DDQL $Q$ -function discrete discrete $4$ $4\times 1$ (encoding) (OpenAI Gym) $16$ $4$ $4\times 15\times 2$ (weights)

4.2.2 Policy Approximation

This section covers VQC-based approximations in policy space, as described for the instance of classical policy gradients in Eqs. 13 and 14. The concept was introduced by Jerbi et al. [Jer+21], shortly followed by a slight reformulations in Ref. [Kun22], and an extension to allow for faster computation in Ref. [BAQ23]. Several modifications, including formulating full-quantum interaction with a quantum control environment, have been introduced in Sequeira et al. [SSB23] – with a closer analysis of quantum-accessible environments revealing potential advantage compared to certain classical routines in Ref. [Jer+23]. Algorithmic extensions to the QPG setup were proposed in Ref. [Mey21]. Details on a therein introduced classical post-processing function to improve RL performance are discussed in Meyer et al. [Mey+23a], and quantum natural gradients to enhance trainability are covered by the same authors in [Mey+23].

Citation	First Author	Title
[Jer+21]	S. Jerbi	Parameterized Quantum Policies for Reinforcement Learning
[Kun22]	L. Kunczik	Reinforcement Learning with Hybrid Quantum Approximation in the NISQ Context
[BAQ23]	Quafu Group	Quafu-RL: The Cloud Quantum Computers based Quantum Reinforcement Learning
[SSB23]	A. Sequeira	Policy gradients using variational quantum circuits
[Jer+23]	S. Jerbi	Quantum Policy Gradient Algorithms
[Mey+23a]	N. Meyer	Quantum Policy Gradient Algorithm with Optimized Action Decoding
[Mey+23]	N. Meyer	Quantum Natural Policy Gradients: Towards Sample-Efficient Reinforcement Learning

Table 9: Work considered for “QRL with VQCs– Policy Approximation” (Sec. 4.2.2)

Parameterized Quantum Policies for Reinforcement Learning, Jerbi et al. (2021) and related work

Summary. The paper by Jerbi et al. [Jer+21] starts out with a small summary of VQC-based ML models. They cite several reports of quantum advantage in the supervised and unsupervised QML. This motivates their approach to go beyond the scope of $Q$ -function approximation [Che+20, LS20, LS21, SJD22], and use the VQC to directly approximated the policy.

Quantum Policy Gradient. After a brief recap of policy gradient methods for solving RL problems, the authors extend those ideas to a QPG approach. More concretely, they quantize the REINFORCE algorithm [Wil92] with value-function baselines by using VQCs as function approximators for the (stochastic) policy. The define two families of VQC-based policies: (1) A RAW-VQC policy, where the action selection follows Born’s rule. It is defined as $\pi_{\theta}(a|s)=\expectationvalue{P_{a}}_{s,\theta}$ , where $P_{a}$ are the projectors on the elements of the computational basis. This allows action selection with only one evaluation of the quantum circuit; (2) A SOFTMAX-VQC policy, defined as $\pi_{\theta}(a|s)=e^{\beta\expectationvalue{O_{a}}_{s,\theta}}/\sum_{a^{\prime% }}e^{\beta\expectationvalue{O_{a^{\prime}}}_{s,\theta}}$ . The measurement result of an action-dependent observable $O_{a}$ is fed into a single-parameter softmax-function, to form a PDF. The inverse-temperature parameter $\beta$ allows to adjust the peak-width of the distribution, i.e., the greediness of the policy.

Circuit Architecture. The ansatz for the VQC is chosen to be hardware-efficient, i.e., only single and two-qubit gates. The RL state is encoded with $1$ -qubit rotations. To increase the expressivity of the model, the authors introduce additional learnable state-scaling parameters $\lambda$ . Those are multiplied to the rotational parameter denoting the state value, i.e., $\lambda_{i}\cdot s_{i}$ is the value of a $1$ -qubit rotation. This also helps circumvent the problem of being restricted to a finite set of frequencies in such an encoding scheme [SSM21]. The feature map is repeated several times, alternating with the variational layer, which forms a data re-uploading structure [Pér+20]. A variational layer consists of $CZ$ -gates for creating entanglement in an circular structure. The learnable parameters are used in $1$ -qubit parameterized rotation gates. Depending on the policy type, measurements are either conducted in the computational basis, or more complex observables are measured.

Experimental Results. Overall, all agents are able to learn meaningful behavior in the OpenAI Gym environments CartPole, MountainCar, and Acrobot. Further experiments are reported, which serve the purpose of assessing the importance of the various design choices: (1) Circuit depth increases performance and learning speed, where SOFTMAX-VQC policies outperform RAW-VQC policies in all instances; (2) Incorporating learnable state scaling parameters increases learning performance, trainable classical weights (in case of SOFTMAX-VQC) multiplied to expectation values leads to increase in performance; (3) The performance gap between RAW-VQC and softmax-VQC policies seems to stem from the ability to adjust greediness.

Provable and Empirical Quantum Advantage. To the best of our knowledge, this work is the first to corroborate the idea quantum advantage with VQCs in the RL setting. Therefore, the authors devise RL environments (based on the discrete logarithm problem (DLP)), which are supposed to be classically intractable. Any classical algorithm would need a number of samples that scales exponential in the problem size to achieve a low generalization error. A VQC-based algorithm with a very specific architecture only requires a polynomial amount of data. This implies an exponential advantage w.r.t. sample complexity, assuming it is infeasible to efficiently simulate the VQC on classical hardware for large problem instances. The construction of the environment is inspired by previous results from QML, where similar learning separations between classical and quantum models have been demonstrated [LAT21].

Further, the authors report numerical evidence of potential quantum advantage for environments based on expectation values sampled from VQCs. The motivation lies in the (potential) intractability of simulating the given VQC classically for large systems. More concretely, one uses a VQC to define a labeling function (in the sense of a classification task) over the domain $[0,2\pi]^{2}$ (so-called SL-VQC). This synthetic classification dataset is then rephrased as a RL environment by incorporating some temporal structure (denoted as Cliffwalk-VQC). Numerically, the authors observe a performance separation of models with classical DNNs and VQC-based policies. They claim, that this is likely due to the oscillatory structure in the labeling function.

Remarks. While the proposal of provable quantum advantage is obviously quite encouraging, the practical realization is probably out of reach for the NISQ-era. The idea of solving the task efficiently on quantum hardware is based on Shor’s algorithm. Formulated as a VQC-based RL problem, this would require circuits of complexity far beyond current scope. We think it requires also some more large-scale experiments, to support the empirical learning separation on the SL-VQC and Cliffwalk-VQC environments. A comparison to other hybrid models [Che+20, LS20] shows, that the proposed QPG approach is superior in terms of RL performance on various environments.

Alternative Formulation. In the PhD thesis by L. Kunczik [Kun22] a slightly different formulation of the QPG framework is introduced, where the output of the quantum circuit is compounded with a classical weight vector. However, the underlying routine is very similar to [Jer+21]. Empirical results are reported to verify an desirable scaling of VQC-based (as opposed to NN-based) approaches. However, experiments are to small-scale for reliable statements regarding this correlation.

Cloud Computing. The work by the BAQIS Quafu Group [BAQ23] realizes the framework introduced in Sec. 4.2.2 and executed it on the quantum devices provided via the Quafu cloud services. The results are ambiguous, as the agents trained on hardware are not really able to learn meaningful behaviour – but are also only trained for a very limited number of timesteps, as also acknowledged by the authors.

Table 10: *

Algorithmic Characteristics - Jerbi et al. [Jer+21] Environment Algorithm Quantum State Action Qubits Parameterized Type Component Space Space Gates¹¹1 this entails encoding, scaling, and variational parameters; the SOFTMAX-VQC also uses classical parameters; CartPole REINFORCE Policy continuous discrete $4$ $30$ (OpenAI Gym) $4$ -dim $2$ MountainCar REINFORCE Policy continuous discrete $2$ $36$ (OpenAI Gym) $2$ -dim $3$ Acrobot REINFORCE Policy continuous discrete $6$ $72$ (OpenAI Gym) $6$ -dim $3$ SL-VQC REINFORCE Policy continuous $2$ -dim discrete $2$ $2$ 37 Cliffwalk-VQC (see [Jer+21]) CognitiveRadio REINFORCE Policy discrete discrete $n$ $30$ to $75$ (see [Che+20]) $n^{2}$ $n$ for $n=2$ to $5$

Policy gradients using variational quantum circuits, Sequeira et al. (2023) and related work

Summary. The article by Sequeira et al. [SSB23] proposes a quantum version of the REINFORCE algorithm with a VQC-based function approximator, very similar to Jerbi et al. [Jer+21]. The methods are applied to the classical environments CartPole and Acrobot but also to a simple quantum control problem. It proposes an initialization technique for the variational parameters of a VQC. Following the experimental results, a quantum advantage w.r.t. the number of required parameters and trainability of the models is claimed.

Underlying Reinforcement Learning Algorithm. As in Jerbi et al. [Jer+21], the policy is defined as $\pi_{\theta}(a|s)=e^{\beta\cdot\expectationvalue{O_{a}}_{\theta}}/\sum_{a^{% \prime}}e^{\beta\cdot\expectationvalue{O_{a^{\prime}}}_{\theta}}$ , and REINFORCE updates are performed. Hereby, the expectation values $\expectationvalue{O_{a}}_{\theta}$ for action $a$ is defined as the expectation $\expectationvalue{\sigma_{z}^{a}}$ , i.e., the expectation value of $1$ -qubit Pauli- $Z$ observable measured on the $a$ -th qubit.

VQC Architecture. The architecture follows the typical three-part structure. In the beginning, the states are encoded with $R_{x}$ rotations, with the state values normalized to the range $[-\pi,\pi)$ . Consequently, the number of qubits has to correspond to $\max\{\absolutevalue{\mathcal{A}},\absolutevalue{\mathcal{S}}\}$ . There are several parameterized layers (see Fig. 9) which incorporate variational parameters in $1$ -qubit $R_{y}$ and $R_{z}$ rotations. The entanglement structure can be described as $CX[i,(i+l)\mod n]$ , where $n$ is the number of qubits, and $l$ the index of the layer. The measurement of $1$ -qubit Pauli- $Z$ observables is a deviation to the procedure proposed by Jerbi et al. [Jer+21], where multi-qubit observables were used.

Complexity of Gradient Estimation. The paper gives an estimation of the required number of samples to get an $\epsilon$ -approximation of the log-policy gradient. According to this consideration, for a success probability of $1-\delta$ , the number of required measurements is bounded by $c\cdot\frac{(1-\epsilon)^{2}}{\epsilon^{2}}\cdot\log(\frac{k}{\delta})$ . Hereby, $c$ is a constant depending on algorithmic hyperparameters and $k$ is the number of variational parameters. It is important to state, that this refers to the number of samples / data points required to get a good approximation of the true policy gradient, but not the explicit estimation of the gradients themself via e.g. the parameter-shift rule.

Initialization Technique. There is some work proposing a technique for parameter initialization to avoid barren plateaus [Gra+19]. However, a technique to boost the overall performance has not yet been proposed. Inspired by classical ML, the authors aim to break symmetries between different neurons (as usually initialization with constant values is a bad choice). A typical strategy is to select values uniformly at random from $[-\pi,\pi]$ , or drawn them following a Gaussian distribution.

Inspired by the classical Glorot initialization scheme [GB10], the paper proposed to use a normal distribution $\mathcal{N}(0,\mathrm{std}^{2})$ with $\mathrm{std}=g\cdot\sqrt{2/\left(\mathrm{fan}_{in}+\mathrm{fan}_{out}\right)}$ . Here, $g$ is a constant multiplicative factor, $\mathrm{fan}_{in}$ is the number of embedded features, and $\mathrm{fan}_{out}$ is the number of computational basis measurements. This technique demonstrates some promising experimental results, but no theoretical justification is given.

Analysis of Fisher Information Spectrum. The paper analyzes the spectrum of the Fisher information matrix (FIM), which serves as a tool to quantify the trainability of a model. The empirical FIM is computed as $F(\theta)=\frac{1}{T}\sum_{t=1}^{T}\nabla_{\theta}\log\pi(a_{t}|s_{t},\theta)% \nabla_{\theta}\log\pi(a_{t}|s_{t},\theta)^{t}$ . A similar analysis has also been proposed for QML [Abb+21].

The results show, that the spectrum of the FIM associated with the quantum model exhibits significantly larger averaged eigenvalues. The compared NN was optimized over several architectures, but not many details are provided in the paper. The authors conclude, that the quantum models are beneficial in terms of trainability, and might be resilient to barren plateaus.

Experimental Results and Discussion of Potential Quantum Advantage. The proposed algorithm is tested on the classical benchmark environments CartPole and Acrobot. The performance is compared to the best classical NN (it is not clear, what best means in this case, and to what extend this holds). The authors claim a significant advantage in terms of convergence speed.

Additional experiments are conducted with the proposed Quantum-Glorot initialization technique. In the two environments CartPole and Acrobot, this technique demonstrates to be beneficial in terms of convergence speed and training stability.

Finally, the experiments are extended to a QuantumControl environment. It requires to learn the map** $\ket{0}\to\ket{1}$ via the time dependent Hamiltonian $H(t)=4J(t)\sigma_{z}+h\sigma_{x}$ . This is converted to a set of unitary gates $U(t)$ , such that $\ket{\psi_{t+1}}=U(t)\ket{\psi}$ . The reward is defined as the overlap between the prepared state and $\ket{1}$ , i.e. $r_{t}=\absolutevalue{\expectationvalue{\psi_{t}|1}}^{2}$ . The agent has to decide between the two actions $0\leavevmode\nobreak\ \hat{=}\leavevmode\nobreak\ \text{no pulse}$ and $1\leavevmode\nobreak\ \hat{=}\leavevmode\nobreak\ \text{apply pulse}$ . The usage of a quantum environment removes the necessity of encoding classical states. Unfortunately, it is not described, how $\ket{\psi_{t}}$ is incorporated in the VQC (a $1$ -qubit parameterized circuit is apparently used to solve the task). The results on this environment suggest, that the agent is able to learn the optimal pulses in a low number of epochs.

Summarizing the experiments, the authors claim an advantage in convergence speed compared to classical approaches (questionable, as there should be NNs which perform much better). Additionally, there seems to be a clear advantage in terms of parameter complexity.

Remarks. The authors claim, that it is possible to estimate the log-policy gradient with only an logarithmic amount of samples (in the number of variational parameters). While this certainly holds for simulation, it is not clear, if such a technique can be applied on quantum hardware (e.g., some kind of sparse or perturbed gradients). The introduced initialization strategy gives some good experimental results, although some additional experiments and theoretical justifications would be desirable. The formulation of the empirical FIM drops the dependency on the prior state distribution, which potentially renders the considered spectrum less representative of the model than for a generic supervised learning problem. The claim of quantum advantage w.r.t. parameter complexity and absence of barren plateaus should be supported with experiments on larger-scale environments.

Quantum-Accessible Environments. An explicit analysis of quantum-accessible environments is conducted in Jerbi et al. [Jer+23]. One instance of such an environment is considered in [SSB23], but also [Wu+23] uses a related formulation. The paper derives explicit quadratic advantages in sampling complexity, if the learned policy satisfies certain regularity conditions. We consider this to be a very important step toward identifying the actual potential of QRL. Interestingly, the stated results suggest that most of the scenarios studied in literature actually satisfy the smoothness conditions. An open problem is the identification of practically relevant problems that can be formulated in the described quantum-accessible setting.

Table 11: *

Algorithmic Characteristics - Sequeira et al. [SSB23] Environment Algorithm Quantum State Action Qubits Parameterized Type Component Space Space Gates CartPole REINFORCE Policy continuous discrete $4$ $4$ (encoding) (OpenAI Gym) $4$ -dim $2$ $24$ (weights) Acrobot REINFORCE Policy continuous discrete $6$ $6$ (encoding) (OpenAI Gym) $6$ -dim $3$ $36$ (weights) QuantumControl REINFORCE Policy, quantum discrete $N/A$ $0$ (encoding)¹¹1 the RL state is a quantum state, i.e. no classical information has to be encoded; (see [SSB23]) Environment $2$ $N/A$ (weights)

Quantum Policy Gradient Algorithm with Optimized Action Decoding, Meyer et al.
(2023)

Summary. The work by Meyer et al. [Mey+23a] builds upon the QPG framework introduced in [Jer+21]. It takes a closer look at the introduced RAW-VQC policy and – based on measurements in the computational basis – introduces a classical post-processing function for action selection. By optimizing this function w.r.t. a novel quality measure, significant performance improvements can be made. The introduced procedure is also suited for problems with large action spaces. Experiments on a $5$ -qubit quantum device represent the first successful training of a VQC-based RL routine on actual quantum hardware.

Classical Post-Processing. The work focuses on the RAW-VQC policy, i.e. $\pi_{\boldsymbol{\theta}}(a|\boldsymbol{s})=\expectationvalue{P_{a}}_{% \boldsymbol{s},\boldsymbol{\theta}}$ . For measurements in the computational basis, this can be viewed as a partitioning of all possible bitstrings $\mathcal{C}$ . This allows the definition of a classical post-processing function $f_{\mathcal{C}}:\{0,1\}^{n}\to\{0,1,\cdots,\absolutevalue{\mathcal{A}}-1\}$ , such that $f_{\mathcal{C}}(\boldsymbol{b})=a$ , iff $\boldsymbol{b}\in\mathcal{C}_{a}$ . The policy can therefore be expressed as $\pi_{\boldsymbol{\theta}}(a|\boldsymbol{s})\approx\frac{1}{K}\cdot\sum_{k=0}^{% K-1}\delta_{f_{\mathcal{C}}(\boldsymbol{b}^{(k)})=a}$ where $\boldsymbol{b}^{(k)}$ is the bitstring observed in the $k$ -th shot.

Globality Measure. The formulation in terms of a classical post-processing function allows for the definition of a quality measure on the explicitly used partitioning of $\mathcal{C}$ . The authors start out with the extracted information $\text{EI}_{f_{\mathcal{C}}}(\boldsymbol{b})$ , which denotes the number of bits necessary to get an unambiguous assignment of the bitstring $\boldsymbol{b}$ to the set $\mathcal{C}_{a}$ it is contained in. This is extended to a globality measure by averaging over all possible bitstrings, i.e. $G_{f_{\mathcal{C}}}:=\frac{1}{2^{n}}\sum_{\boldsymbol{b}\in\{0,1\}^{n}}EI_{f_{% \mathcal{C}}}(\boldsymbol{b})$ . This measure quantifies, how much information is used on average to make an decision for an action. While this measure is hard to compute in general, the authors discuss an explicit construction of a post-processing function, that guarantees saturating the globality measure (which is trivially upper-bounded by the number of involved qubits). Based on that construction, an optimal post-processing function is given by $f_{\mathcal{C}}(\boldsymbol{b})=\left[b_{0}\cdots b_{m-1}\left(\bigoplus_{i=m}% ^{n-1}b_{i}\right)\right]_{10}$ , where $\left[\cdot\right]_{10}$ refers to the decimal representation and $m=\log_{2}(\absolutevalue{\mathcal{A}})-1$ .

Experimental Results. The claim that a high value of the globality measure correlates with a good RL performance is experimentally demonstrated on several environments. Experiments on the CartPole benchmark with globality values ranging from $G_{f_{\mathcal{C}}}=1.0$ to the maximum possible $G_{f_{\mathcal{C}}}=4.0$ show a clear correlation between the measure and the actual performance of the resulting algorithm. It is noted, that the construction of the post-processing function explicitly is detached from the complexity of the actual quantum model, and therefore is a very efficient way to improve the performance. The QRL agents with $G_{f_{\mathcal{C}}}>3.0$ also outperform the SOFTMAX-VQC policy, which was originally conjectured to be superior in [Jer+21]. These results are strengthened by experiments on FrozenLake and ContextualBandits environments. Empirical results regarding effective dimension and the Fisher information spectrum [Abb+21] also demonstrate an improved expressivity and trainability of models with high globality measure.

Training on Quantum Hardware. Using this enhanced QPG algorithm, the authors execute a full training routine on an $8$ -state and $2$ -action ContextualBandits environment on quantum hardware. They employ a $3$ -qubit sub-topology of the $5$ -qubit IBM quantum device ibmq_manila [IBM23]. The results confirm, that training VQC-based QRL algorithms on actual hardware is indeed possible. However, there is still a deterioration of performance compared to the noise-free simulation, which is explained by the currently inevitable hardware noise. Verification of the learned parameters demonstrates, that the agent actually identifies the optimal action in all cases, only the certainty of that decision is less pronounced compared to simulation.

Remarks. The described action decoding procedure is easy to extend to problems with large action spaces. However, some additional engineering is necessary to account for action spaces of size that cannot be expressed as a power of two. It is left open, at which point the benefit of using a post-processing function with high globality is out-weighted by the likely occurrence of barren plateaus [Cer+21]. Potentially the flexible definition of the post-processing function can be used to balance those two objectives. While the demonstration of trainability on quantum hardware is certainly pretty small-scale, it can be considered an important step towards the practical usability of these type of algorithms.

Table 12: *

Algorithmic Characteristics - Meyer et al. [Mey+23a] Environment Algorithm Quantum State Action Qubits Parameterized Type Component Space Space Gates¹¹1 this entails encoding, scaling, and variational parameters; CartPole REINFORCE Policy continuous discrete $4$ $24$ to $40$ ²²2 the SOFTMAX-VQC also uses additional classical parameters; (OpenAI Gym) $4$ -dim $2$ FrozenLake REINFORCE Policy discrete discrete $4$ $24$ to $40$ (OpenAI Gym) $16$ $4$ ContextualBandits REINFORCE Policy discrete discrete $5$ $70$ (see [SB18]) $32$ $8$ ContextualBandits REINFORCE Policy discrete discrete $3$ $30$ (see [SB18])³³3 hardware experiment: modified circuit structure to reduce transpilation overhead, details in [Mey+23a]; $8$ $2$

Quantum Natural Policy Gradients: Towards Sample-Efficient Reinforcement Learning, Meyer et al. (2023)

Summary. The paper by Meyer et al. [Mey+23] proposes an enhanced training routine for the framework proposed in [Jer+21] and extended in [Mey+23a]. A second-order extension – based on so-called quantum natural gradients – is employed to define the quantum natural policy gradient (QNPG) algorithm. The modified technique is experimentally demonstrated to have preferable properties regarding trainability, and is also verified on actual quantum hardware.

Natural Gradients. The original QPG algorithm is trained based on first-order updates, i.e. $\Delta\boldsymbol{\theta}=\alpha\nabla_{\boldsymbol{\theta}}\mathcal{L}(% \boldsymbol{\theta})$ . This update structure has the shortcoming, that it is closely tied to the Euclidean geometry and does not take into account the actual curvature of the loss landscape. This can be mitigated by using the FIM $F(\boldsymbol{\theta})$ , which describes the local curvature of the parameter space around a given point. This can be used to define a natural gradient update as $\Delta\boldsymbol{\theta}=\alpha F^{-1}(\boldsymbol{\theta})\nabla_{% \boldsymbol{\theta}}\mathcal{L}(\boldsymbol{\theta})$ [Ama98].

Quantum Natural Policy Gradients. In order to employ this concept for training in the quantum realm, the paper employs a generalization of the classical FIM. This quantum FIM (derived from the Fubini-Study metric tensor [Che10]) $g(\boldsymbol{\theta})$ is hard to compute in general – however, a block-diagonal approximation can be estimated efficiently in hardware [Sto+20]. Based on that the paper defines the QNPG update rule as $\Delta\boldsymbol{\theta}=\alpha g^{\dagger}(\boldsymbol{\theta})\nabla_{% \boldsymbol{\theta}}\mathcal{L}(\boldsymbol{\theta})$ . Additionally, a regularized version of the QNPG algorithm is introduced, to counter instabilities encountered during inverting the quantum FIM. It has to be highlighted, that the overhead of incorporating these second-order update rule is almost negligible compared to the anyways necessary computation of first-order gradients. The pipeline of the overall algorithm is visualized in Fig. 10.

Experimental Results. The effectiveness of the training routine is demonstrated on different instances of ContextualBandits. On a small-scale setting with only a single qubit and two trainable parameters, it is shown that the (regularized) QNPG algorithm converges significantly faster for random initializations compared to the original QPG formulation. For specific initializations it is moreover validated, that the second-order extension does what it was designed for and helps to traverse distorted regions of the loss landscape. An up-scaled experiment with a $12$ -qubit VQC underlines the efficiency of the introduced routine.

Training on Quantum Hardware. To demonstrate the practical feasibility of the QPG approach the authors train an medium-scale instance on actual quantum hardware. The experiment employs a $12$ -qubit sub-topology of the $27$ -qubit system ibmq_ehningen [IBM23]. The results demonstrate, that the quantum agent is actually able to learn meaningful behavior in the $4096$ -state ContextualBandits environment. There is some deterioration of the performance compared to noise-free simulation, which is not caused by the training routine itself, as demonstrated by experiments with analytically optimal parameters. However, the learned policy identifies the correct action in a majority of the cases, similar to the hardware results in [Mey+23a].

Remarks. The paper demonstrates the effectiveness of the QNPG routine for ContextualBandits environment, the extension to more generic problems is however left for future work. A very interesting consideration is the influence of quantum natural gradients on the barren plateau problem, which is discussed with different results in Refs. [HK21, Tha+23]. The hardware experiment using $12$ qubits is a big improvement upon the results in [Mey+23a] and can be considered as the currently largest-scale practical demonstration of VQC-based QRL.

Table 13: *

Algorithmic Characteristics - Meyer et al. [Mey+23] Environment Algorithm Quantum State Action Qubits Parameterized Type Component Space Space Gates ContextualBandits QNPG Policy discrete discrete $1$ $1$ (encoding) (see [SB18]) $2$ $2$ $2$ (weights) ContextualBandits QNPG Policy discrete discrete $12$ $12$ (encoding) (see [SB18])¹¹1 hardware experiment: hardware-native circuit structure, details in [Mey+23]; $4096$ $2$ $36$ (weights)

4.2.3 Combined Approximations

It is possible to combine the approach of approximation in value space from Sec. 4.2.1 and in policy space from Sec. 4.2.2. This is formulated in an actor-critic approach in Wu et al. [Wu+23], which is re-implemented and extended in Refs. [Kwa+21, Ree23]. An asynchronous training routine is proposed by S. Y.-C. Chen [Che23]. A soft actor-critic formulation is described by Q. Lan [Lan21]. An extension to multiple agents is proposed in Yun et al. [Yun+22] and extended in Ref. [YPK23].

An overview of progress in the field of quantum multi-agent RL can be found in Ref. [ZY23].

Citation	First Author	Title
[Wu+23]	S. Wu	Quantum reinforcement learning in continuous action space
[Kwa+21]	Y. Kwak	Introduction to Quantum Reinforcement Learning: Theory and PennyLane-based Implementation
[Ree23]	V. Reers	Towards Performance Benchmarking for Quantum Reinforcement Learning
[Che23]	S. Y.-C. Chen	Asynchronous training of quantum reinforcement learning
[Lan21]	Q. Lan	Variational Quantum Soft Actor-Critic
[Yun+22]	W. J. Yun	Quantum Multi-Agent Reinforcement Learning via Variational Quantum Circuit Design
[YPK23]	W. J. Yun	Quantum Multi-Agent Meta Reinforcement Learning

Table 14: Work considered for “QRL with VQCs– Combined Approximations” (Sec. 4.2.3)

Quantum reinforcement learning in continuous action space, Wu et al. (2023)

Summary. This paper by Wu et al. [Wu+23] extends the concept of VQC-based RL to continuous action spaces. The authors choose a quantum control environment, more concretely one that encodes an eigenvalue problem. This allows to interpret the action as a (parameterized) unitary. The experimental results suggest an exponential reduction in model complexity compared to classical approaches.

Eigenvalue Problem as RL Environment. The RL agent has to solve an eigenvalue problem, i.e., find the eigenvalue of a given Hamiltonian. This should be done in the following iterative procedure: Let $H$ be the Hamiltonian of an $n$ -qubit quantum system $E$ and $\ket{s_{0}}$ an initial state from $E$ . The system should be driven towards the eigenstate of $H$ , denoted as $\ket{u_{0}}$ . Also the corresponding eigenvalue $\lambda_{0}$ should be returned. Although not explicitly stated in the paper, we assume the agent should search for the eigenstate with the associated smallest eigenvalue, as this corresponds to the ground state.

The observation for this environment is the current quantum state $\ket{s_{t}}$ , which is provided to the agent via some quantum channel. The actions the agent can execute correspond to parameterized unitaries $U(\theta_{t})$ , where $\theta_{t}$ are classical parameters sampled from the VQC via measurements. Once instantiated, this unitary is applied to $\ket{s_{t}}$ to evolve the state $U(\theta_{t})\ket{s_{t}}=\ket{s_{t+1}}$ . The agent receives a classical reward, which describes the closeness of the current state to the searched eigenstate of the Hamiltonian.

The authors state, that their proposed technique has some parallels to Grover’s search. More concretely, the trained agent provides an alternative to the amplitude amplification procedure, which could alternatively be used to solve the task at hand.

Model Architecture and Underlying RL Algorithm. The overall approach can be considered hybrid, as the optimization of the VQC parameters is still conducted on classical hardware. A schematic description of the approach is given in Fig. 11. The agent observes a quantum state from the environment, which is used as the initial state $\ket{s_{t}}$ of the VQC function approximator. Measurements on the prepared quantum state determine the parameters $\ket{\theta_{t}}$ . Those are then fed into the unitary operator $\ket{U(\theta_{t})}$ and applied to the environment state. The new state $\ket{s_{t+1}}$ , combined with an ancilla reward qubit initialized to $\ket{0}$ , is then evolved using some user-defined reward unitary $U_{r}$ . Measurements are performed on this state to determine the reward produced by the executed action. This procedure repeats several timesteps, with the objective to approximate the eigenstate $\ket{u_{0}}$ .

The VQC architecture does not incorporate a feature map, as the observation $\ket{s_{t}}$ is used as the initial state $\ket{\Phi}$ . Each parameterized layer consists of $1$ -qubit rotations and a circular entanglement structure. For every element of the action parameters $\theta_{j}$ , there is an associated observable $B_{j}$ , which is measured on the prepared quantum state. (The paper does not mention, how the action unitary $U(\theta)$ is explicitly constructed.) Following this step, a phase estimation circuit implements the reward unitary $U_{r}=U_{PE}$ . This transforms the state to the basis of eigenstates, i.e., $U_{PE}\ket{0}\ket{s_{t+1}}=\sum_{k=1}^{n}\alpha_{t+1,k}\ket{\lambda_{k}}\ket{u% _{k}}$ . With a measurement of the eigenvalue phase register, the desired eigenvalue $\lambda_{0}$ is observed with a probability of $p_{t+1}=|\alpha_{t+1,0}|^{2}=|\expectationvalue{s_{t+1}|u_{0}}|^{2}$ . The reward can then be defined as e.g. $r_{t+1}=p_{t+1}-p_{t}$ . Obviously, for $p_{t+1}\to 1$ , the state $\ket{s_{t+1}}$ converges to $\ket{u_{0}}$ .

The underlying RL routine is an actor-critic method. Therefore, the paper combines a policy-VQC as actor and a $Q$ -function-VQC as critic to a so-called quantum deep deterministic policy gradient (QDDPG) algorithm. The experience of the agent, i.e., tuples $(\ket{s_{t}},\theta_{t},r_{t},\ket{s_{t+1}})$ , are stored in a replay buffer to prevent overfitting. Additionally, target networks are employed for both, the actor and the critic.

Experimental Results and Model Complexity. All experimental results in the paper are based on classical simulations. For training, the Hamiltonian $H=\frac{1}{4}(s_{x}\sigma_{x}+s_{y}\sigma_{y}+s_{z}\sigma_{z}+I)$ is instantiated with the coefficients $(s_{x},s_{y},s_{z})=(0.13,0.28,0.95)$ . Concrete details on the training procedure, e.g., the number of episodes, are not stated. The trained model is applied to $1000$ random initial states. The overlap with the respective $\ket{u_{0}}$ is approaching one, consequently the agent is able to get quite close to the desired eigenstate in all cases. The trained model shows good generalization capabilities, i.e., it can be applied to various initial states. This is in contrast to e.g. a variational quantum eigensolver (VQE), where the control pulse for one initial state is meaningless for other ones.

The overall gate complexity for one RL episode is stated as $\mathcal{O}(m\cdot\mathrm{polylog}(N))$ . Here, $m$ is the number of shots for sampling expectation values and $N$ denotes the number of qubits. This statement assumes that $H$ can be efficiently simulated as otherwise the complexity of $U_{PE}$ would exceed $\mathcal{O}(\mathrm{polylog}(N))$ . Additionally, all VQCs in the method are also assumed to have a gate complexity of at most $\mathcal{O}(\mathrm{polylog}(N))$ . With this perquisites, the authors claims an exponential advantage in model complexity compared to classical approaches.

Generalization to Discrete Action Spaces. The paper also generalizes the presented concept to discrete action spaces, with the FrozenLake environment as an example. The observations are encoded as basis states into the VQC via computational encoding, similar to Chen et al. [Che+20]. The movements applied by the actions are formulated as unitaries acting on the VQC state. A slight generalization of Chen et al. [Che+20] is used for this, which allows to perform the transforms $\ket{0}\to\ket{1}$ and $\ket{1}\to\ket{0}$ in a parameterized manner. The reward unitary is formulated in a similar fashion. It is stated that experiments with this configuration were successful, but no concrete results are provided.

Remarks. There are some caveats and ambiguities we identified regarding the proposed approach. First, the algorithm requires knowledge of and ability to prepare the desired eigenstate $\ket{u_{0}}$ for the training procedure. With this state already known, the whole procedure of reproducing it is a somewhat circular task. However, as the learned model seems to generalize to different input states, the technique offers clear advantage over approaches like quantum phase estimation. Second, the model requires repeated preparation of the environment state $\ket{s_{t}}$ , as it is disturbed by measurements to extract the reward information. This should be doable, as one knows the state preparation routine $\ket{s_{t}}=U(\theta_{t-1})\cdots U(\theta_{0})\ket{s_{0}}$ . The influence of this additional overhead is unfortunately not considered in the complexity considerations discussed above. Third, the claim of exponential quantum advantage w.r.t. model complexity (i.e. $\mathcal{O}(\mathrm{polylog}(N))$ for all VQCs) should be supported by larger-scale experiments.

Table 15: *

Algorithmic Characteristics - Wu et al. [Wu+23] Environment Algorithm Quantum State Action Qubits Parameterized Type Component Space Space Gates Quantum Actor-Critic “QDDPG” $Q$ -function, quantum conti- nuous¹¹1 output is interpreted as parameters of a unitary, i.e. a quantum operation applied to the environment; $n$ $0$ (encoding)²²2 the RL state is a quantum state, i.e. no classical information has to be encoded; $n\times d\times 3$ (weights)³³3 variational gates: $qubits\times layers\times per\_qubit\_per\_layer$ ; details are not specified; Eigenvalues Policy, (see [Wu+23]) Environment FrozenLake Actor-Critic $Q$ -function, discrete⁴⁴4 state and action space are encoded into the quantum realm for a neat integration into the pipeline; discrete⁴⁴4 state and action space are encoded into the quantum realm for a neat integration into the pipeline; $n$ $0$ (encoding)²²2 the RL state is a quantum state, i.e. no classical information has to be encoded; (OpenAI Gym) “QDDPG” Policy $16$ $4$ $N/A$ (weights)

Introduction to Quantum Reinforcement Learning: Theory and PennyLane-based Implementation, Kwak et al. (2021)

Summary. The paper by Kwak et al. [Kwa+21] gives a short introduction to both RL and (variational) QC. This is followed up by a tutorial on how to implement a VQC-enhanced RL algorithm with PennyLane to solve the CartPole environment.

Hybrid RL Agent. The paper employs the typical hybrid structure, with the VQC as a function approximator. The optimization of the parameters and the interaction with the CartPole environment is executed on classical hardware. The underlying algorithm uses an actor-critic approach, where the actor is quantum and the critic is classical. A set of $1$ -qubit rotations is used to encode the state of the CartPole environment into the four-qubit system. This encoding layer is followed by $4$ layers with learnable $1$ -qubit rotations and an unspecified entanglement structure. The result is extracted from the measurement of $2$ qubits in the computational basis and the respective expectation values are interpreted as the action-value function.

Remarks. The agent is able to surpass random behavior, but lacks behind other hybrid approaches [LS20, Jer+21]. To the best of our understanding, the implemented quantum actor-critic approach deviates in some details from previously considered approaches. Most importantly, a hybrid approach is used, where the actor is represented with a VQC and the critic employs a classical DNN. A benchmark analysis of the described setup is proposed and conducted by V. Reers [Ree23].

Table 16: *

Algorithmic Characteristics - Kwak et al. [Kwa+21] Environment Algorithm Quantum State Action Qubits Parameterized Type Component Space Space Gates¹¹1 encoding gates: $qubits\times per\_qubit$ ; variational gates: $qubits\times layers\times per\_qubit\_per\_layer$ ; CartPole Actor-Critic²²2 only the actor employs a VQC, the critic uses a classical DNN; $Q$ -function continuous discrete $4$ $4\times 1$ (encoding) (OpenAI Gym) $4$ -dim $2$ $4\times 4\times 3$ (weights)

Asynchronous training of quantum reinforcement learning, S. Y.-C. Chen (2023)

Summary. This work by S. Y.-C. Chen [Che23] introduces an actor-critic approach, that is trainable in an asynchronous fashion. This yields the big advantage, that training could be spread out over several classical simulators or quantum hardware devices. The efficiency of the introduced quantum asynchronous advantage actor critic (QA3C) algorithm compared to previous formulations is demonstrated on several benchmark environments.

Quantum A3C. The underlying concept is based on the classical A3C algorithm [Mni+16]. This framework makes use of a global shared memory and a process-specific memory for each individual agent. Each agent interacts with the environment independently, and only once certain criteria are met the global model is updated using the information provided by the local agents. This enables a distributed and therefore easy parallelizable training routine. The approximator for $Q$ -function and policy both are realized using VQCs with classical neural networks pre- and appended to form a hybrid model.

Experimental Results. The proposed QA3C algorithm is executed on the environments Acrobot, CartPole, and MiniGrid-SimpleCrossing. It is observed over all instances, that the hybrid quantum model is competitive with a much larger classical model. Moreover it is demonstrated, that QA3C outperforms classical A3C employing classical models of comparable complexity.

Remarks. The distribution of the training among several workers is certainly an important consideration taking the current access modalities of quantum hardware providers into account. However, it is not clear if training practically can be distributed considering the long queue waiting times. Moreover, it has to be taken into account, that it is not clear what actually is the role of the VQC, due to the appended neural networks. However, the comparison to full-classical agents of similar size is an interesting consideration. As usually it has to be highlighted that the experiments were to small-scale to make meaningful statements on potential quantum advantage.

Table 17: *

Algorithmic Characteristics - S. Y.-C. Chen et al. [Che23] Environment Algorithm Quantum State Action Qubits Parameterized Type Component Space Space Gates¹¹1 the training process is distributed over $80$ workers, which incorporate a local copy of the parameters; Acrobot (OpenAI Gym) Actor-critic ”QA3C “ $Q$ -function, Policy continuous $6$ -dim³³3 action and state-spaces are mapped to the required dimensionality by using classical neural networks; discrete $3$ ³³3 action and state-spaces are mapped to the required dimensionality by using classical neural networks; $8$ $N/A$ (encoding) $48$ (weights)²²2 actor and the critic are composed of an individual hybrid model, i.e. the number of weights are doubled; $148$ (classical) CartPole (OpenAI Gym) Actor-critic ”QA3C “ $Q$ -function, Policy continuous $4$ -dim³³3 action and state-spaces are mapped to the required dimensionality by using classical neural networks; discrete $2$ ³³3 action and state-spaces are mapped to the required dimensionality by using classical neural networks; $8$ $N/A$ (encoding) $48$ (weights)²²2 actor and the critic are composed of an individual hybrid model, i.e. the number of weights are doubled; $107$ (classical) SimpleCrossing (OpenAI Gym) Actor-critic ”QA3C “ $Q$ -function, Policy continuous $127$ -dim³³3 action and state-spaces are mapped to the required dimensionality by using classical neural networks; discrete $6$ ³³3 action and state-spaces are mapped to the required dimensionality by using classical neural networks; $8$ $N/A$ (encoding) $48$ (weights)²²2 actor and the critic are composed of an individual hybrid model, i.e. the number of weights are doubled; $2431$ (classical)

Variational Quantum Soft Actor-Critic, Q. Lan (2021)

Summary. The paper by Q. Lan [Lan21] introduces a quantum version of a soft actor-critic (SAC) approach. The advantage of this algorithm, compared to previous suggestions, is the possibility to work with a continuous action space. The algorithm is tested on the Pendulum environment.

Soft Actor-Critic for Continuous Control. The term continuous control refers to a setup, in which the agent acts in a continuous action space. Most publications in the context of QRL deal with discrete action spaces, while a few others discuss continuous control for quantum environments [Wu+23, SSB23]. This work focuses on classical environments, which requires some kind of action decoding based on measurements of the quantum state. Instead of directly selecting the actions based on measurement results, the parameterized hybrid model learns the parameters of a distribution, from which the action is sampled. The VQC, and a downstream NN, are used to represent mean $\mu$ and variance $\sigma$ of a Gaussian distribution. This allows the agent to act in a continuous action space in a straightforward manner.

In contrast to the standard RL setup, SAC [Haa+18] not only aims to optimize the expected return, but also the policy entropy [Zie+08, Haa+17]. Therefore, the expected return is defined as $G_{t}=\sum_{i=t}^{\infty}\gamma^{i-t}(r(s_{i},a_{i})+\alpha\mathcal{H}[\pi_{% \theta}(\cdot|s_{i})])$ , where $\mathcal{H}[p]=-\int_{\mathbb{R}}p(x)\log p(x)\mathrm{d}x$ is the differential entropy for the probability density function $p(x)$ . Among other advantages, this entropy normalization potentially enhances exploration by encouraging more stochastic policies [Haa+17].

VQC Architecture. The paper considers two different VQC architectures. The first one uses the typical three-part structure of rotational encoding, variational layers, and measurements. The second architecture is more complex, as it uses data re-uploading [Pér+20], and a more complex encoding structure [SJD22, Jer+21]. It can be expected, that the second choice gives rise to more expressive models, which usually correlates with RL performance.

Experimental Results. The experimental section of the paper compares the performance of the two resulting quantum SAC approaches to a classical NN on the Pendulum environment. On the one hand, the quantum model with the simple VQC architecture is inferior to the other two approaches. On the other hand, the quantum model with data re-uploading performs similar to the classical model, and both are able to learn near-optimal behavior. The quantum model incorporates only $41$ parameters, while the classical one uses $1250$ . This is interpreted as an quantum advantage w.r.t. parameter complexity.

Some additional architecture experiments are conducted, mainly focusing one the depth of the underlying VQCs. It is observed, that a certain number of variational layers is required to enable training. Overall, the performance is strongly correlated with the concrete architecture choice, which is in line with the results known from literature [Fra+22].

Remarks. To substantiate the claim of quantum advances w.r.t. parameter complexity, more experiments with increasing environment size should be performed. By using NNs in combination with VQCs, it is not completely clear, which part of the learning is actually conducted by the quantum part. The differing performance of the two architecture choices highlight the importance of designing a sophisticated data encoding scheme.

Table 18: *

Algorithmic Characteristics - Q. Lan [Lan21] Environment Algorithm Quantum State Action Qubits Parameterized Type Component Space Space Gates¹¹1 the hybrid model also incorporates additional classical parameters in an appended NN; Pendulum Quantum- $Q$ -function continuous conti- nuous $3$ $3$ to $12$ (encoding) (OpenAI Gym) SAC $3$ -dim $36$ (weights)

Quantum Multi-Agent Reinforcement Learning via Variational Quantum Circuit Design, Yun et al. (2022)

Summary. This paper by Yun et al. [Yun+22] introduces a quantum multi-agent reinforcement learning (QMARL) approach. It is applied to an environment inspired by wireless communication. The authors achieve results that are competitive with classical NNs with higher parameter complexity.

QMARL Framework and VQC Architecture. The approach is inspired by the classical method of centralized training with decentralized execution (CTDE). This approach deals with the problems introduced by a non-stationary reward structure, caused by the interaction of multiple agents [Low+17].

The actor-critic structure employs only a single critic (i.e. represented by a single VQC), which receives the rewards. A naive implementation would would increase the qubit count with the number of agents. To resolve this problem, the state encoding routine is modified, such that only one qubit is required for each agent.

The general VQC architecture follows the typical three-part structure. The states are encoded using a feature map with $1$ -qubit rotations. The state space of the environment is four-dimensional. Consequently, four qubits are used to represent the actor associated to each of the four agents. For the critic, all rotations for the state of one agent are applied to a single qubit. This implies a qubit count equal to the number of agents (i.e. implemented for $4$ qubits in the article). The following learnable layer(s) consist of $1$ -qubit rotations and some unspecified entanglement structure. The choice of the measured observables $M$ are not explicitly stated.

Experimental Results and Discussion. The QMARL algorithm is applied to a communication task referred to as Single-Hop Offloading environment. It simulates two clouds, between which packages have to be distributed along four edges. Each cloud and edge has a queue with a certain capacity. One agent is used to learn the actions of its associated edge. The objective is to minimize the overflow and underflow of queues.

The paper compares four different multi-agent reinforcement learning (MARL) and QMARL frameworks: (1) The described version, where actor and critic are represented with a VQC; (2) A modified pipeline, where the critic is represented with a classical NN; (3) A small-scale classical MARL approach; All three setups contain $50$ trainable parameters each. (4) A large-scale classical MARL algorithm with over $40000$ trainable parameters.

The results demonstrate, that the QMARL approach (1) is competitive with the large-scale MARL algorithm (4). In contrast, the hybrid QMARL method (2) and also the small-scale classical MARL seem to lack expressivity to solve this task. The authors conclude, that QMARL yields some quantum advantage, as the parameter complexity is drastically reduced.

Remarks. Potentially, compressing all observations of one agent into one qubit is not sufficient to represent the information in a lossless manner. Therefore, larger-scale experiments should be conducted to get more insights into the proposed quantum multi-agent architecture. The same holds for the reduced parameter complexity compared to classical models.

Table 19: *

Algorithmic Characteristics - Yun et al. [Yun+22] Environment Algorithm Quantum State Action Qubits Parameterized Type Component Space Space Gates Single-Hop Multi-Agent $Q$ -function Policy continuous $4$ -dim discrete $4$ $4$ $4$ or $16$ (encoding)¹¹1 the $4$ quantum actors use $4$ encoding parameters each; the quantum centralized critic contains $16$ ; $N/A$ (weights) Offloading Actor-Critic (see [Yun+22]) “QMARL”

Quantum Multi-Agent Meta Reinforcement Learning, Yun et al. (2023)

Summary. The second paper by Yun et al. [YPK23] extends their previous approach [Yun+22] with various new techniques for QMARL. It proposes to use meta-learning by pre-training only one individual agent. This is followed by a fine-tuning the multi-agent scenario. Therefore, two different types of trainable parameters are used, i.e. trainable measurements are introduced to complement the typical variational parameters. The approach is also extended to continual learning, where meta-learning is performed on multiple environments at once.

VQC Architecture and meta-QMARL. The underlying RL algorithm employs an SAC approach with the VQC as function approximator for the action-value function. The quantum circuit uses the three-layer structure of $1$ -qubit rotation data encoding, variational layers with entanglement gates, and measurement. The paper applies QRL to multi-agent problems and extends the original proposal on quantum CTDE by Yun et al. [Yun+22]. An additional step is introduced for the training procedure, resulting in a meta-learning approach.

In order to realize these concepts, the authors define two different sets of parameters. First, there are the typical variational parameters $\boldsymbol{\phi}$ , usually parameterizing $1$ -qubit rotations. Second, it is also possible to parameterize and train the measurement observables. The paper proposes to use $M_{\theta_{1,2}^{(m)}}=R_{x}^{\dagger}(\theta_{1})\cdot R_{y}^{\dagger}(\theta% _{2})\cdot Z\cdot R_{y}(\theta_{2})\cdot R_{x}(\theta_{1})$ as observable on the $m$ -th qubit, i.e. two trainable parameters for each $1$ -qubit observable. Basically, this trainable observable introduces a change of basis, as final measurements are always performed in the computational basis. The instantiated observable can be visualized on the Bloch sphere as the angle w.r.t. which the measurement is performed.

Both parameter sets are trained in alternating steps, where the first one is referred to as meta quantum neural network (QNN) angle training, and focuses exclusively on the variational parameters $\boldsymbol{\phi}$ . This step trains only a single quantum agent, which interacts with several other agents in the multi-agent environment. Unfortunately, the authors do not state how this interaction is actually realized. We assume, that the quantum agent interacts with other classical agents in this initial training phase. During training, the pole parameters $\boldsymbol{\theta}$ are not updated, but they can be varied with some randomly selected value to form a kind of angle-to-pole regularization. The second phase, the local QNN pole training, focuses on the parameterized observables. Those are fine-tuned individually for each copy of the meta-trained QNN, corresponding to the all-quantum agents interacting in the multi-agent environment. The authors propose, that by meta-training the network, it is more efficient to fine-tune the individual agents. This is justified with the lower parameter complexity, as the variational parameters remain constant in the second training phase. The loss function is the sum of all $Q$ -learning losses of the individual agents.

Additionally, the paper introduces the concept of pole memory, which refers to storing the trained pole parameters for the individual agents. As these sets are much smaller than the set of variational parameters, it is more efficient to store the full configuration.

Experimental Results. The introduced training routing is executed on a two-step two-agent environment. It is observed, that the meta-training convergence is slower than direct training of a QMARL agent. However, once this training has converged, finetuning is much more efficient. Overall, the authors conclude, that the additional step of meta-training enhances convergence in a multi-agent environment.

Extension to Continual Learning. The above setting is also extended to continual learning, i.e. training in more than one environment (or typically the same environment with slightly altered dynamics).

The investigation focuses on the difference in performance with and without the use of pole memory. The results suggest that resetting the poles to the initial state (i.e. the parameter setting with which meta training was conducted) benefits convergence speed and stability in an environment with alternating dynamics. Meta training with a higher degree of angle-to-pole regularization seems to enhance the generalization performance of the meta-QNN.

Remarks. The paper does not state explicitly how exactly the initial meta training is conducted. Considering the results, the VQCs seem to have some capability w.r.t. transfer learning, as which the meta-learning and continual training can be interpreted. The idea of employing trainable observables has also potential for other approaches, as it partially avoids the necessity to explicitly pre-select an action decoding scheme. Practically, these trainable observables are introduced by adding an additional layer to the VQC which learns a specific measurement. A significant difference to pre-existing procedures is that these parameters are not trained simultaneously with the typical variational parameters. It is not completely clear, whether this two-step training procedure is beneficial in a general setup.

Table 20: *

Algorithmic Characteristics - Yun et al. [YPK23] Environment Algorithm Quantum State Action Qubits Parameterized Type Component Space Space Gates¹¹1 the parameter counts are denoted for a single agent; Single-Hop Meta-Multi- $Q$ -function continuous $4$ -dim discrete $4$ $4$ $4$ (encoding) $N/A$ (weights) Offloading Agent SAC (see [Yun+22]) “Meta-QMARL” Two-Step Meta-Multi- $Q$ -function continuous $4$ -dim discrete $2$ $2$ $2$ (encoding) $N/A$ (weights) Game Agent SAC (see [YPK23]) “Meta-QMARL”

4.2.4 Offline Methods

Offline reinforcement learning [Lev+20] deals with the setting, when no direct interaction with the environment is possible. Instead, the agent is trained on a set of pre-acquired data. Two alternative formulations for the quantum realm have been proposed in Periyasamy et al. [Per+23] and Cheng et al. [Che+23].

Citation	First Author	Title
[Per+23]	M. Periyasamy	Batch Quantum Reinforcement Learning
[Che+23]	Z. Cheng	Offline Quantum Reinforcement Learning in a Conservative Manner

Table 21: Work considered for “QRL with VQCs– Offline Methods” (Sec. 4.2.4)

Batch Quantum Reinforcement Learning, Periyasamy et al. (2023)

Summary. In this work, Periyasamy et al. [Per+23] propose batch-constrained quantum $Q$ -learning (BCQQ), a offline QRL algorithm based on the classical discrete batch-constrained deep $Q$ -learning (BCQ) algorithm by Fujimoto et al. [Fuj+19]. Furthermore, the authors introduce a novel data re-uploading (DRU) scheme, which they call cyclic DRU. Experiments are executed in the OpenAI CartPole environment.

Algorithm. The key idea in BCQ is that in order to avoid a distributional shift from training to testing, a trained policy should induce at test time a similar state-action visitation to that observed in the the offline training data, the so-called batch. Hence, the name batch-constrained.

To achieve this, BCQ trains a generative model $G_{\omega}$ to pre-select likely actions based on the batch. Through this selection, the policy is constrained to only choose from a subset of actions. In the case of a discrete action space, the generative model can be understood as a map $G_{\omega}:\mathcal{S}\rightarrow\Delta\left(\mathcal{A}\right)$ that takes the current environment state as input and outputs the probability with which each action would occur in the batch. In particular, if the batch is filled using transitions from a policy $\pi_{b}$ then the generative model should imitated this policy, i.e. $G_{\omega}(a|s)\approx\pi_{b}(a|s)$ . Therefore, $G_{\omega}$ is called imitator.

Through using this imitator, actions can be pre-selected by discarding actions whose probability relative to the most likely one is below a threshold $\tau$

\tilde{\mathcal{A}}(s)=\left\{a\in\mathcal{A}\Bigg{|}\frac{G_{\omega}(a|s)}{% \text{max}_{\hat{a}\in\mathcal{A}}G_{\omega}(\hat{a}|s)}>\tau\right\}.

(46)

The actions selected by the imitator are then evaluated by a $Q$ -network, which is trained by only considering the selected actions in the loss computation. The imitator itself is trained with a standard cross-entropy loss

l(\omega)=-\sum_{(s,a)\in\mathcal{B}}\text{log}\left(G_{\omega}(a|s)\right).

Additionally, to address the overestimation bias of $Q$ -learning towards state transitions that are underrepresented in the batch, double DQN [VGS16] is employed.

Finally, the BCQQ algorithm is obtained by applying the variational quantum deep $Q$ -networks (VQ-DQN) proposed by Franz et al. [Fra+22] as function approximators for both the imitator and $Q$ -network. Moreover, for the model training the authors use the AMSgrad optimizer in combination with gradients approximated via SPSA. Wiedmann et al. [Wie+23] demonstrated that SPSA can be used to efficiently train medium-sized VQCs with a reduced number of circuit runs, compared to the commonly used parameter-shift rule.

Model Architecture. The VQC used as the function approximator for the imitator and $Q$ -network is shown in Fig. 12. Each entry of the four-dimensional state vector returned by the CartPole environment is encoded using a single qubit Rx gate on an individual qubit. The variational block comprises five layers containing four parameterized Ry, and four parameterized $R_{z}$ gates each. In addition to the parameterized rotational gates, each layer also includes two-qubit CZ entanglement gates with nearest-neighbor connectivity. The CartPole environment has two discrete actions. Therefore, the expectation value of the Pauli- $ZZ$ observable on qubits 1 and 2 and Pauli- $ZZ$ observable on qubits 3 and 4 are used to decode the $Q$ -values from the VQC. Furthermore, trainable classical weights are applied on both expectation values to increase the range of possible $Q$ -values.

Periyasamy et al. [Per+22] established that spreading encoding gates for the feature vector of a given data point throughout the quantum circuit results in an improved representation of the data when the expectation values are measured for observables containing all Pauli strings. Following this, the authors use a re-uploading scheme, which exposes each qubit to all the entries of the current input state vector. Contrary to the standard data re-uploading, where the encoding scheme is re-introduced after each variational layer as such, the encoding scheme is re-introduced with the input state vector shifted by one step in a round-robin fashion. The structure of a VQC with this cyclic DRU is shown on the right of Fig. 12.

Experimental Results and Discussion. In order to evaluate the performance of BCQQ, the authors train policies on buffers with varying sizes, filled with randomly sampled environment interactions. As a classical benchmark the authors train neural networks instead of VQCs on the same buffers. For this benchmark, they first use a fully connected neural network with a total number of 67270 parameters and second a smaller network with just 55 parameters. The number of parameters in the smaller network is much more comparable to the VQC. The authors find that the BCQQ agent is able to learn an optimal policy, achieving the maximum reward of 500, from a buffer of just 100 random environment interactions. Interestingly, the classical agents fail to learn a policy in this low data regime, suggesting a potential quantum advantage in terms of the sample efficiency.

Moreover, the cumulative reward these models can achieve beyond 500 is tested, which shows that the VQC with cyclic DRU out-performs the VQC with standard DRU. All these experiments were performed using an early stop** criteria, where during training the current policy is evaluated in the actual environment to save computational resources. Strictly speaking, this makes the training not fully offline. In a second experiment however, the authors train the VQC with cyclic DRU on a buffer filled with 100 interactions obtained from an optimal policy with noise. From this, the authors show that without early stop** the BCQQ agent can learn an optimal policy from this noisy buffer.

Remarks. It remains to be shown that the observed sample efficiency scales to more complex environments. Furthermore, a more elaborate analysis of the effectiveness of cyclic DRU could give insights for future VQC design.

Table 22: *

Algorithmic Characteristics - Periyasamy et al. [Per+23] Environment Algorithm Quantum State Action Qubits Parameterized Type Component Space Space Gates¹¹1 encoding gates: $qubits\times per\_qubit$ ; variational gates: $qubits\times layers\times per\_qubit\_per\_layer$ ; CartPole (OpenAI Gym) BCQ Imitator, $Q$ -function continuous $4$ -dim discrete $2$ $4$ $4\times 1$ (encoding) $4\times 15\times 2$ (weights) $N/A$ (classical)²²2 model incorporates classical weights after measurement, details are not stated;

Offline Quantum Reinforcement Learning in a Conservative Manner, Cheng et al. (2023)

Summary. This work by Cheng et al. [Che+23] introduces the offline QRL algorithm, conservative quantum $Q$ -learning (CQ2L). In contrast to online RL, offline RL is used in scenarios where the agent cannot interact with the environment during training and is hence trained purely data-driven from a set of previously collected data. The proposed algorithm is based on the classical conservative $Q$ -learning (CQL) algorithm by Kumar et al. [Kum+20]. Experiments are conducted in the OpenAI CartPole, Acrobot and MountainCar environments.

Algorithm. The objective of offline RL is to learn a near-optimal policy from a fixed dataset $\mathcal{D}$ sampled with a behavior policy $\pi_{b}$ , without further environment interactions. A major challenge in this setting is that the fundamental assumption that agents can sample data online is violated. This means that agents have to learn a policy or value function from out-of-distribution (OOD) data, which is nontrivial. This distributional shift makes it hard to evaluate and consequently improve current Q-value functions, leading to an extrapolation error [Kos+21].

Under the online setting, agents obtain corrective feedback through environment interactions. However, for offline training, the extrapolation error means that agents could overestimate $Q$ -values for unseen state-action pairs, which could lead to poor performance. Hence, CQL suppresses the overestimation problem in offline RL by learning a conservative $Q$ -value function. In particular, this is achieved via double $Q$ -learning [VGS16] and a penalty term to update the Q-values in a conservative manner. The resulting conservative update target is obtained as

\underset{Q}{\text{argmin}}\;\alpha\cdot\underset{s\sim\mathcal{D}}{\mathbb{E}% }\left(\log\sum_{a\in A}\exp(Q(s,a;\theta_{k}^{A}))-\underset{a\sim\pi_{b}}{% \mathbb{E}}Q(s,a;\theta_{k}^{A}))\right)+\underset{(s,a,r,s^{\prime})\sim% \mathcal{D}}{\mathbb{E}}\left(Y_{k}^{\text{DoubleQ}}-Q(s,a)\right)^{2},

(47)

with the double $Q$ -learning target update

Y_{k}^{\text{DoubleQ}}:=r+\gamma\cdot Q(s^{\prime},\underset{\overline{a}\in A% }{argmax}\;Q(s^{\prime},\overline{a};\theta_{k}^{A});\theta_{k}^{B}).

(48)

Here, $\theta_{k}^{A}$ and $\theta_{k}^{B}$ denote two independent sets of parameters, which are updated similarly to the target network in the deep $Q$ -network (DQN) algorithm, by symmetrically exchanging the roles of $\theta_{k}^{A}$ and $\theta_{k}^{B}$ in Eq. 48. Having these independent parameters helps to compute unbiased $Q$ -value estimates. CQ2L is then obtained by implementing the $Q$ -value function via the variational VQ-DQN proposed by Franz et al. [Fra+22].

Model Architecture. VQCs with 5 layers to represent $Q$ -value functions are used. For CartPole, Acrobot and MountainCar 4, 6 and 2 qubit systems are used, respectively. According to the feasible actions in these environments, quantum observables $[Z_{0}Z_{1},Z_{2}Z_{3}]$ , $[Z_{0},Z_{1},Z_{2}]$ , and $[Z_{0},Z_{0}Z_{1},Z_{1}]$ are chosen, where $Z_{i}$ denotes the readout of a Pauli Z gate on the $i$ th qubit. Input data are encoded with X rotation gates, while the variational part includes X, Y, and Z rotation gates. Moreover, qubits are entangled in a circular topology. The variational part, entanglement, and data encoding are repeated several times, which is then measured by Pauli Z gates to determine the $Q$ -values.

Experimental Results and Discussion. To evaluate the offline QRL algorithm, the authors create offline data sampled by a DQN agent with epsilon-greedy policy, interacting with the corresponding environment. The sampled data are recorded in a replay buffer with length $1\times 10^{6}$ and then saved for offline QRL. The logged data contain tuples of $(s_{t},a_{t},r_{t},s_{t+1},d)$ , where $d$ indicates whether an episode terminates. For training, a single trajectory from the collected buffer is selected.

The authors compare the performance of CQ2L with the off-policy VQ-DQN trained offline on the same data. These experiments show that CQ2L is able to solve all given environments and outperform offline VQ-DQN. The latter indicates that it is not feasible to directly extend off-policy QRL algorithms like VQ-DQN to the offline setting. Furthermore, the authors find that CQ2L performs only marginally worse than online VQ-DQN in CartPole. Interestingly, online VQ-DQN fails to solve Acrobot and MountainCar and is clearly outperformed by CQ2L.

Finally, the performance is compared to classical CQL, where a fully connected neural network with a similar number of parameters as the VQC is used. The results indicate that CQ2L could achieve comparable performance to the classical one. Besides, no significant advantages in the sample efficiency or the parameter size are observed. The authors hypothesize that this may indicate that the current structure of VQCs or the limited number of qubits is not sufficient to exhibit quantum advantages for QRL.

Remarks. The performance is compared to classical CQL, where a fully connected neural network with a similar number of parameters as the VQC is used. The results indicate that CQ2L could achieve comparable performance to the classical one. Besides, no significant advantages in the sample efficiency or the parameter size are observed. The authors hypothesize that this may indicate that the current structure of VQCs or the limited number of qubits is not sufficient to exhibit quantum advantages for QRL. This result contradicts other observations in the literature, where at least for small system sizes some improvement w.r.t. parameter complexity was observed. However, we agree with the statement, that such performance improvements might strongly depend on the specific VQC architecture.

Table 23: *

Algorithmic Characteristics - Cheng et al. [Che+23] Environment Algorithm Quantum State Action Qubits Parameterized Type Component Space Space Gates¹¹1 encoding gates: $qubits\times per\_qubit$ ; variational gates: $qubits\times layers\times per\_qubit\_per\_layer$ ; CartPole (OpenAI Gym) CQL $Q$ -function continuous $4$ -dim discrete $2$ $4$ $4\times 1$ (encoding) $4\times 15\times 2$ (weights) $N/A$ (classical)²²2 model incorporates classical weights after measurement, details are not stated; Acrobot (OpenAI Gym) CQL $Q$ -function continuous $6$ -dim discrete $3$ $6$ $4\times 1$ (encoding) $4\times 15\times 2$ (weights) $N/A$ (classical)²²2 model incorporates classical weights after measurement, details are not stated; MountainCar (OpenAI Gym) CQL $Q$ -function continuous $2$ -dim discrete $3$ $2$ $4\times 1$ (encoding) $4\times 15\times 2$ (weights) $N/A$ (classical) ²²2 model incorporates classical weights after measurement, details are not stated;

4.2.5 Algorithmic and Conceptual Extensions

This section describes extensions to the VQC-based QRL framework, that have relevance for multiple of the previously classified methods. This entails tools to deal with partially observable (quantum) environments discussed in Kimura et al. [Kim+21]. A big emphasis is put on the explicit design of model architectures. Work by Hsiao et al. [Hsi+22, Tru+23] demonstrates that this is indeed an important topic, as otherwise everything could be easily emulated with classical architectures. Different approaches to this design task are discussed in Refs. [Che23c, Che23a, Dră+22, Kru+23, SMT23, ACN23, PPR20]. Avoiding the typical gradient-based training routines, a evolutionary approach is proposed by Chen et al. [Che+22] and also discussed in Refs. [DS23, Köl+23].

Citation	First Author	Title
[Kim+21]	T. Kimura	Variational Quantum Circuit-Based Reinforcement Learning for POMDP and Experimental Implementation
[Hsi+22]	J.-Y. Hsiao	Unentangled quantum reinforcement learning agents in the OpenAI Gym
[Tru+23]	N. Truong	Investigating Quantum Reinforcement Learning structure to the CartPole control task
[Che23c]	S. Y.-C. Chen	Quantum deep recurrent reinforcement learning
[Che23a]	S. Y.-C. Chen	Efficient quantum recurrent reinforcement learning via quantum reservoir computing
[Dră+22]	T.-A. Drăgan	Quantum Reinforcement Learning for Solving a Stochastic Frozen Lake Environment and the Impact of Quantum Architecture Choices
[Kru+23]	G. Kruse	Variational Quantum Circuit Design for Quantum Reinforcement Learning on Continuous Environments
[SMT23]	Y. Sun	Differentiable Quantum Architecture Search for Quantum Reinforcement Learning
[ACN23]	E. Andrés	Efficient Dimensionality Reduction Strategies for Quantum Reinforcement Learning
[Che+22]	S. Y.-C. Chen	Variational quantum reinforcement learning via evolutionary optimization
[DS23]	L. Ding	Multi-objective evolutionary search for parameterized quantum cirucits
[Köl+23]	M. Kölle	Multi-Agent Quantum Reinforcement Learning using Evolutionary Optimization

Table 24: Work considered for “QRL with VQCs– Algorithmic and Conceptual Extensions” (Sec. 4.2.5)

Variational Quantum Circuit-Based Reinforcement Learning for POMDP and Experimental Implementation, Kimura et al. (2021)

Summary. The paper by Kimura et al. [Kim+21] extends the concept of VQC-based RL to partially observable environments. The approach is inspired by classical model-free, complex-valued RL [HSS06]. Additionally, a novel VQC architecture (novel with regard to measurement procedure) is proposed. A detailed description of the gradient computation with backpropagation techniques is provided (it is not quite clear how this method generalizes to quantum hardware).

Partially Observable MDP. A partially observable Markov decision process (POMDP) is described as a tuple $(S,A,T,R,\Omega,O)$ and is a generalization of a MDP. The variable $S$ denotes a discrete state space, $A$ is a discrete set of actions, $T(s^{\prime}|s,a)$ describes the state transition probabilities and $R(s,a)$ is a reward function. Extending the fully-observable case, $\Omega$ is a discrete set of observations and $O(o|s,a)$ is an observation probability matrix with $o\in\Omega$ .

One caveat of partially observable environments is the perceptual aliasing problem. This refers to the property, that the agent cannot distinguish two different states due to the limited observation ability. An example of such an environment is the partially observable maze used in Kimura et al. [Kim+21]. Similar to most gridworld environments, the task is to navigate from the start state to the goal state on the shortest path possible. However, the observations provided to the agent are ambiguous as several cells return the same state indicator.

Solving POMDPs with Complex Valued RL. One way to bypass this state ambiguity is to introduce a belief distribution over possible states. Unfortunately, this is computationally expensive. An alternative approach is complex-valued RL [HSS06, MNM17]. It incorporates time series information into the action-value function, which represented as complex numbers. More concretely, the complex $\dot{Q}$ -function ( $\dot{x}$ denotes complex values) encodes the history of the agent, i.e. the previously visited states. The cumulative reward value is expressed by the absolute value of $\dot{Q}$ -function, while the path length of the propagated reward is represented by the phase of the $\dot{Q}$ -function on the complex plane. Therefore, $\dot{Q}$ -function-Learning keeps continuity w.r.t. the described internal reference value. This helps distinguish states which are affected by the perceptual aliasing problem. Formally, this is achieved by updating the complex values in the opposite phase direction. The complex-valued $\dot{Q}$ -function can be represented with tabular methods [HSS06], or with complex-valued NNs [MNM17].

The update mechanism represents a generalized $Q$ -learning approach, i.e. the objective is to optimize the loss function $L_{\theta}=\frac{1}{2}\cdot|\dot{Q}(o_{t-k},a_{t-k})-(r_{t+1}+\gamma\cdot\dot{% Q}_{max}(t))\dot{u_{t}}(k)|$ . Here, $\dot{u_{t}}(k)=\dot{\beta}^{k+1}$ is a complex hyperparameter and $k$ is the trace length. The NN is replaced with a VQC as action-value function approximator in the following.

VQC Architecture and Gradient Computation. The paper deviates in several design choices from the standard method. Most importantly, the $\dot{Q}$ -values for the different actions are not extracted from the same circuit (e.g. measurement on different qubits corresponding to different actions). Instead, the actions are encoded into the VQC with a feature map similar to the one used for state encoding. Consequently, different circuits have to be evaluated for each action. This encoding can happen either directly in the feature map, or alternatively into the decoding unitary. However, the three-part structure of the circuit is preserved.

The encoding unitary $U_{encoder}$ consists of simple parameterized $1$ -qubit rotations, where three different concrete encodings are considered as shown in Fig. 13. The Type 1 feature map encodes the observations directly with an $\arcsin$ function. Type 2 uses a computational encoding for the observations, basically equivalent to the one proposed by Chen et al. [Che+20]. Type 3 also uses an $\arcsin$ transform, but directly encodes the action information into the feature map.

The variational part repeats several layers of parameterized $1$ -qubit rotations, followed by a circular entanglement structure. In the experimental part, the authors consider different circuit depths.

The output of the circuit is evaluated with the Hadamard Test, which measures the prepared state against an output unitary $U_{out}$ . This introduces an overhead since a controlled version of the unitary $U_{out}$ needs to be implemented (details in Fig. 13). Additionally, an ancilla qubit, and three $1$ -qubit gates are required. The output unitary itself consists also of learnable $1$ -qubit rotations. If encoding unitaries of Type 1 or Type 2 are used, it additionally encodes the action information. The real and imaginary output of the Hadamard test are used to construct the complex-valued $\dot{Q}$ -function.

The evaluation of the circuits is straightforward on quantum hardware. However, this does not apply to evaluating gradients w.r.t. the parameters, which is necessary for training. The paper gives a detailed derivation on how to compute the gradients via simulation on classical hardware. The idea is inspired by classical backpropagation and somewhat looks like the adjoint method [Luo+20]. This makes it infeasible, at least in the given form, for actual quantum hardware.

Experimental Results and Discussion. The paper compares the training results (on the described maze environment) for the three types of quantum agents to different classical agents. The classical tabular approach outperforms all other methods, as the underlying algorithm guarantees an optimal solution. The authors argue, that there seems to be some intrinsic advantage of the Type 2 quantum circuits, as these perform better then the other approximate algorithms.

Remarks. We think there needs to be some further investigation regarding the applicability of the algorithm to actual quantum hardware. Currently, we propose to consider the approach as QiRL. We agree, that QC offers great potential for complex-valued RL, as QC itself deals with complex numbers. However, there are still open questions regarding the most promising way to exploit this connection. A quantum version of a POMDP is discussed in Ref. [BBA14], which might provide for an interesting extension of this paper.

Unentangled quantum reinforcement learning agents in the OpenAI Gym, Hsiao et al. (2022)

Summary. The paper by Hsiao et al. [Hsi+22] uses an hybrid proximal policy optimization (PPO) algorithm, with a combination of VQC and NN as policy function approximator. The quantum circuit architecture is untypical, as it only uses $1$ -qubit rotations. Consequently, no entanglement is created, and all qubits can be considered as independent systems. Still, the resulting RL agent is able to learn good policies on some standard environments (CartPole, Acrobot, and LunarLander). The learned parameters are ported to quantum hardware and tested with sophisticating results.

Underlying RL Algorithm and Model Architecture. The classical RL algorithm is PPO, i.e. an policy-based approach. It follows the typical hybrid setup, as the VQC is used as function approximator, and parameter updates are computed on classical hardware. To enhance the expressivity of the model, a classical NN is appended. It uses the measured expectation values as inputs. The outputs of the network are post-processed using a softmax function.

The structure of the hybrid model is displayed in Fig. 14. The feature map consists of $1$ -qubit rotations, which is a common choice in the literature. The variational (‘parameter’ in Fig. 14) layer incorporates $1$ -qubit parameterized rotations. It is important to highlight that the circuit does not contain any multi-qubit gates. Consequently, no entanglement between the qubits is created. As efficient classical simulation of the circuit is possible, the approach should be counted towards QiRL. Despite this, the authors demonstrate, that a good RL training performance can be achieved with this model.

Experimental Results. The described hybrid agent is trained on three tasks from the OpenAI Gym, i.e. CartPole, Acrobot, and LunarLander. The quantum agents outperforms several classical architectures. As this is achieved with much fewer parameters, the authors claim that the approach points towards potential advantage.

The results on LunarLander are remarkable in that regard, that it might be the most complex environment solved with VQC-based RL thus far. While the classical simulability prohibits any intrinsic quantum advantage, the models still are able to achieve a good performance. This gives rise to the questions, whether one can draw inspiration from quantum mechanics for purely classical approaches.

Testing on Quantum Hardware. Once the models are trained, they are tested with the learned parameters on IBMQ hardware (with up to $8$ qubits, depending on the environment). The models are able to replicate the learned near-optimal behavior.

Remarks. As without entanglement the VQCs can be simulated classically, we agree with the authors that the proposed algorithm should be considered as a QiRL approach. As the proposed model incorporates also a classical network, it is nor clear, what part of the learning is conducted with the VQC. The simple circuit structure might also explain, that the results for testing on hardware are stable. Usually, a big portion of the noise is caused by two-qubit gates, which are not present in the used VQC. A partial re-implementation of this work can also be found in Ref. [Tru+23].

Compendium of Architecture Discussions

As demonstrated by the previously discussed work of Hsiao et al. [Hsi+22], it is important to put careful consideration into the design of the employed quantum model architecture. In the following, we briefly summarize several works that make contributions in that direction. The idea of incorporating the information of multiple timesteps via recurrent networks is discussed by S. Y.-C. Chen [Che23c] and extended in Ref. [Che23a]. Several explicit VQC architectures are compared and analyzed by Drǎgan et al. [Dră+22]. An automated approach for architecture generation is proposed in Sun et al. [SMT23]. Different encoding techniques are discussed by Andrés et al. [ACN23]. Drawing a connection to a different context, the work by Park et al. [PPR20] proposes to vary the architecture itself, by dynamically in- and excluding two-qubit gates.

Recurrent Quantum Neural Networks. The work by S. Y.-C. Chen [Che23c] proposes the use of quantum recurrent neural networks (QRNNs) in the $Q$ -learning setting (see Sec. 4.2.1), specifically quantum long short-term memory (QLSTM) [CYF22]. This enables the agent to also incorporate information from previous timesteps into the decision process. In is experimentally demonstrated on the CartPole environment, that the QRNN is at least least competitive – if not superior – to purely classical models of similar size. It is also discussed that the method might be well suited for partially observable environments, establishing a connection to [Kim+21]. A continuation of this line of research in Ref. [Che23a] proposes a more efficient training routine for QRNN, based on reservoir computing [LJ09] and the QA3C approach discussed in Ref. [Che23].

Explicit Architecture Comparison. A study by Drǎgan et al. [Dră+22] compares various circuit architectures for a modified version of the FrozenLake environment. The underlying algorithm is a quantum version of PPO (see Sec. 4.2.2) and the VQCs are combined with classical NNs to a hybrid model. The results suggest that the performance is strongly dependent on the choice of VQC architecture. Measures like expressibility [SJA19], entanglement capability [SJA19], and effective dimension [Abb+21] provide an a priori indicator for the potential suitability of the architecture. However, there seems to be no clear correlation between the concrete value of these measures and the RL performance.

Continuous Environments and Encoding. The work by Kruse et al. [Kru+23] extends the actor-critic paradigm (discussed e.g. in Ref. [Dră+22]) to continuous action spaces. The authors demonstrate that the quantum agent is able to learn in the environments Pendulum-v1 and LunarLander-v2. It is conjectured, that applying an arctan function to data points – as often done in literature – is indeed counter-productive for the overall performance. Moreover, a stacked encoding is proposed, which uses angle encoding on multiple qubits for a single data point. This allows to avoid pre-processing with a classical neural network, ensuring potential performance improvements can really be attributed to the quantum agent. On both benchmarks a reduction in parameter complexity compared to classical agents is reported. However, this only holds true for certain design choices, which again highlights the importance of architecture selection.

Automatic Generation of Architectures. Sun et al. [SMT23] propose an automated tool for the generation of QRL-suitable circuit architectures. The method is based on differential quantum architecture search (DQAS) [Zha+22], i.e. the architecture itself is trained using gradient-based methods. The approach is studied within the framework of quantum $Q$ -learning (see Sec. 4.2.1) on the FrozenLake environment. Using DQAS, the authors are able to identify a VQC architecture that seems to be very well-suited for the given problem and outperforms some typically used problem-agnostic circuit designs.

Encoding Considerations. The work by Andrés et al. [ACN23] compares different strategies for encoding data into the VQC, all within the context of quantum $Q$ -learning (see Sec. 4.2.1). Evaluations are conducted on three environments within the energy-efficiency and management context. The authors compare three different architecture layouts: (1) classical data is pre-processed and reduced in dimensionality using a NN and encoded via rotational parameters; (2) similar, but data re-uploading [Pér+20] is employed; (3) classical data is normalized and encoded via amplitude encoding [SP18], output is post-processed with a NN; The authors claim superior performance compared to classical models of similar size, especially using amplitude encoding. However, it has to be noted, that the experiments were quite small-scale. The combination with NNs complicates statements on the actual contribution of the quantum part. It also has to be noted, that amplitude encoding might not be NISQ-compatible in the general case.

Variational quantum reinforcement learning via evolutionary optimization, Chen et al. (2022)

Summary. The main focus of the paper by Chen er al. [Che+22] is the investigation of gradient-free evolutionary optimization for $Q$ -learning with VQCs. This routine is tested in two different scenarios, for each of which also a state encoding scheme is proposed. More concretely, amplitude encoding is applied to the CartPole environment. For the gridworld environment MiniGrid with larger state space ( $147$ dimensional), the paper proposes a hybrid model with an encoding mechanism based on tensor network (TN) techniques.

Amplitude Encoding. The observation space of the CartPole environment is $4$ -dimensional. The state values are continuous. This allows the use of amplitude encoding, i.e. two qubits can be used to encode the (re-scaled) values into the four amplitudes of the system. The authors follow the method described in Schuld and Petruccione [SP18]. This works fine for small systems, but requires not NISQ-compatible operators for bigger instances.

TN-based Encoding. The MiniGrid environment is similar to FrozenLake, as the goal in both environments is to navigate from a start to a goal state on the shortest way possible. The paper uses simple environment configurations, with state spaces of size $5\times 5$ , $6\times 6$ , and $8\times 8$ . The observation space is of dimensionality $7\times 7\times 3$ . The agent has to decide between $6$ actions, of which only $4$ are relevant in the simplified scenario. The reward is defined as $1-0.9\cdot\mathrm{number\_steps}/\mathrm{max\_number\_steps}$ . Apart from the larger observation space, we assume this environment to be about the same complexity as FrozenLake.

The paper addresses the problem of encoding the $147$ -dimensional state into a quantum feature map with just $8$ variational parameters. Other work uses e.g. CNNs to reduce the dimensionality of the feature space [LS21]. As the encoding networks have to be pre-trained, it is not quite clear, what part of the work is really done by the VQC. The authors suggest to use a hybrid encoding scheme based on TNs, similar to Chen et al. [Che+21]. The proposed TN technique encodes the observation $[v_{1},\cdots,v_{147}]^{t}$ into the product state $[1-v_{1},v_{1}]^{t}\otimes[1-v_{2},v_{2}]^{t}\otimes\cdots\otimes[1-v_{N},v_{N% }]^{t}$ , where the individual elements are normalized. Those encoded states represented by the red nodes in Fig. 15a. The trainable part of the matrix product state (MPS) outputs an $8$ -dimensional compressed feature vector. This is represented by the $147+1$ blue nodes and the open leg (i.e. outgoing edge) in Fig. 15a. The bond dimension is a hyperparameter of the MPS, which correlates with the number of trainable parameters [Per+06].

VQC Architecture. The model follows the typical three-part architecture, i.e. first the feature map, then the variational part, and finally some measurements. For the CartPole environment, a simple $2$ -qubit circuit with amplitude encoding and $4$ variational layers is used. Both qubits are measured in the Pauli- $Z$ basis and the action corresponding to the higher expectation value is selected. For the MiniGrid environment, the $8$ -qubit circuit with just one repetition of the variational layer is used. The encoding is done with the TN-compressed state, i.e. the output from the TN is encoded into the circuit as shown in Fig. 15b. As the environment has $6$ actions, the top $6$ qubits are measured, and the action corresponding to the highest expectation value is executed.

RL with Evolutionary Optimization. The underlying algorithm is a $Q$ -learning RL approach. The updates of the QNN representing the action-value function are conducted via evolutionary optimization. This implies, that no gradients have to be computed. Usually, this is one major bottleneck of VQC-based RL, which might be circumvented by this approach.

The paper uses a simplistic instance of an evolutionary algorithm, where mutation, but no recombination operations are employed. An initial population of $M$ individuals is generated, which are used to simulate some episodes on the environment. The best $T$ agents (the ones producing the highest reward averaged over several runs) are selected as parents for the next generation. Random Gaussian noise is applied to this parents (mutation), until $M-1$ children are generated. Additionally, the best individual from the previous generation is kept, i.e. again $M$ individuals. This procedure is repeated until a certain convergence criteria is met, e.g. a high enough reward.

Experimental Findings and Discussion. The paper applies the two different encoding methods, combined with the evolutionary optimization idea, to the respective environments. All experiments are conducted as noiseless simulations. On the CartPole environment, the $2$ -qubit architecture achieves an near-optimal performance with only $26$ parameters, which is significantly less than in most state-of-the-art NNs. The authors claim, that with their method the number of parameters can be reduced to $\mathcal{O}(\mathrm{polylog}(n))$ . In contrast, classical ML requires $\mathcal{O}(\mathrm{poly}(n))$ parameters.

The experiments on the MiniGrid environment employ the described hybrid TN-based architecture. Results are compared to an encoding based on a simple NN, presumably similar to Lockwood and Si [LS21]. All approaches achieve a near-optimal performance. Overall, the TN-model (with large enough bond dimension) slightly outperforms the classical approach. The authors consider this as a proof-of-principle for effectiveness of the MPS encoding for RL learning.

Remarks. The amplitude encoding is currently not feasible for more complex problems, due to the lack of an NISQ-compatible state-preparation routine. The evolutionary optimization approach could circumvent some of the problems typically associated with gradient based techniques. Experiments on larger-scale environments might be an interesting direction for future work, to investigate how the evolutionary algorithm deals with more complex optimization landscapes. We suggest to incorporate some recombination procedures into the evolutionary algorithm, to enhance its performance.

Multi-Objective Formulation. Related work by Ding and Spector [DS23] proposes a version of evolutionary search for the automated generation of QRL architectures (see also the discussions on VQC architecture above in Ref. [Hsi+22] and related work). The training itself is done with a QPG approach [Jer+21] (see Sec. 4.2.2) and nested with evolutionary architecture search [DS22]. This procedure is conducted w.r.t. several objectives, including enforcing a as-small-as-possible model size and several noise-related considerations. The approach is validated on the three benchmark environments CartPole, MountainCar, and Acrobot. The results demonstrate improved training behavior – with smaller model size – compared to previous work [Jer+21]. The authors also further analyze the learned architectures for recurring patterns. However, it is acknowledge that larger-scale experiments are necessary to identify a general guideline for architecture selection.

Multi-Agent Scenario. The work by Kölle et al. [Köl+23] extends the framework of Ref. [Che+22] to the multi-agent setting (see Sec. 4.2.3). The authors compare different evolutionary strategies, including mutation-only and two different setups with additional recombination steps. The evaluation is conducted on the CoinGame environment and yields results that are competitive with classical approaches – using significantly fewer parameters. It has to be noted, that the experiments are too small-scale to make reliable statements about the scaling behaviour of this approach. While evolutionary optimization is certainly an interesting consideration compared to gradient-based techniques, the stated advantage regarding reduced proneness to barren plateaus is not sufficiently documented and should therefore be viewed with some scepticism.

4.2.6 Application-Focused Work

This section summarizes work that discusses VQC-based QRL techniques for specific applications. On the one hand, this is a very important area of research, in order to identify practically relevant QRL one day. On the other hand, it has to be noted, that all current work is limited to relatively small problem setups. This can be justified by current hardware restrictions – but also casts some doubt on the scalability of the stated results. Nonetheless, an overview of the considered ideas might be beneficial for further research:

Applications related to robotics and similar control tasks are discussed in Refs. [Acu+22, Hei+22, Cob23, BYK22, SMK23, Hic+23, KCP23]. Planning tasks of different form are the focus of Refs. [Cor+23, San+22, ACN22, Liu+23, Kum+23, Rai+23, SH23, RKM22]. Collaborative environments are addressed with multi-agent methods in Refs. [Yan+22, Par+23, NS+23, Par+23a, PK23, Yun+23, Ans+23]. The field of finances is discussed in Refs. [Che+23b, Yan23]. A back-to-the-roots work considers QRL for board games in Ref. [CRC23]. Last but not least, the task of designing VQC architectures is addressed in Ref. [Che23d].

Citation	First Author	Title
[Acu+22]	A. Acuto	Variational Quantum Soft Actor-Critic for Robotic Arm Control
[Hei+22]	D. Heimann	Quantum Deep Reinforcement Learning for Robot Navigation Tasks
[Cob23]	J. Cobussen	Quantum Reinforcement Learning for Sensor-Assisted Robot Navigation Tasks
[BYK22]	N. F. Bar	An Approach Based on Quantum Reinforcement Learning for Navigation Problems
[SMK23]	A. Sinha	Nav-Q: Quantum Deep Reinforcement Learning for Collision-Free Navigation of Self-Driving Cars
[Hic+23]	M. L. Hickmann	Potential analysis of a Quantum RL controller in the context of autonomous driving
[KCP23]	G. S. Kim	Realizing Stabilized Landing for Computation-Limited Reusable Rockets: A Quantum Reinforcement Learning Approach
[Cor+23]	R. Correll	Quantum Neural Networks for a Supply Chain Logistics Application
[San+22]	F. Sanches	Short quantum circuits in reinforcement learning policies for the vehicle routing problem
[ACN22]	E. Andrés	On the Use of Quantum Reinforcement Learning in Energy-Efficiency Scenarios
[Liu+23]	D. Liu	Multi-agent quantum-inspired deep reinforcement learning for real-time distributed generation control of 100% renewable energy systems
[Kum+23]	M. Kumar	Blockchain Based Optimized Energy Trading for E-Mobility Using Quantum Reinforcement Learning
[Rai+23]	S. Rainjonneau	Quantum Algorithms applied to Satellite Mission Planning for Earth Observation
[SH23]	M. Shahid	Introducing Quantum Variational Circuit for Efficient Management of Common Pool Resources
[RKM22]	F. Rezazadeh	Towards Quantum-Enabled 6G Slicing

Table 25: [Part 1] Work considered for “QRL with VQCs– Application-Focused Work” (Sec. 4.2.6)

Citation	First Author	Title
[Yan+22]	R. Yan	A Multiagent Quantum Deep Reinforcement Learning Method for Distributed Frequency Control of Islanded Microgrids
[Par+23]	S. Park	Quantum Multi-Agent Actor-Critic Networks for Cooperative Mobile Access in Multi-UAV System
[NS+23]	B. Narottama	Layerwise Quantum Deep Reinforcement Learning for Joint Optimization of UAV Trajectory and Resource Allocation
[Par+23a]	S. Park	Quantum Multi-Agent Reinforcement Learning for Autonomous Mobility Cooperation
[PK23]	S. Park	Quantum Reinforcement Learning for Large-Scale Multi-Agent Decision-Making in Autonomous Aerial Networks
[Yun+23]	W. J. Yun	Quantum Multi-Agent Actor-Critic Neural Networks for Internet-Connected Multi-Robot Coordination in Smart Factory Management
[Ans+23]	J. A. Ansere	Quantum Deep Reinforcement Learning for Dynamic Resource Allocation in Mobile Edge Computing-based IoT Systems
[Che+23b]	E. A. Cherrat	Quantum Deep Hedging
[Yan23]	J. Yang	Apply Deep Reinforcement Learning with Quantum Computing on the Pricing of American Options
[CRC23]	J. Chao	Quantum Enhancements for AlphaZero
[Che23d]	S. Y.-C. Chen	Quantum Reinforcement Learning for Quantum Architecture Search

Table 26: [Part 2] Work considered for “QRL with VQCs– Application-Focused Work” (Sec. 4.2.6)

QRL for Robotics and other Control Tasks

The work by Acuto et al. [Acu+22] applies the quantum SAC approach proposed in Ref. [Lan21] to the control of an robotic arm. The environment is implemented as an extension of the Acrobot-v1 environment. On this small-scale setup the hybrid quantum model demonstrates reduced parameter complexity compared to classical methods.

A robot navigation scenario is discussed by Heimann et al. [Hei+22] in a simulated environment. The quantum Q-learning (see Sec. 4.2.1) approach demonstrates parameter reduction compared to classical approaches. The setup is extended to a more complex environment by J. Cobussen [Cob23].

A similar robot navigation task is considered in Bar et al. [BYK22], which employs the Q-learning method proposed in Ref. [Che+20]. The authors report a reduction in the number of parameters, which however also yields a decreased success rate for the considered scenarios.

Collision-free navigation of self-driving cars is considered in Sinha et al. [SMK23]. The authors employ an actor-critic quantum A2C approach, which is similar to the QA3C introduced by Ref. [Che23]. On a small $4$ -qubit toy environment the proposed approach shows improved training stability compared to classical A2C. A similar problem is considered with tools from quantum $Q$ -learning (see Sec. 4.2.1) by Hickmann et al. [Hic+23].

The task of steering reusable rockets is considered in Kim et al. [KCP23]. The unspecified QRL method demonstrates reduced memory requirements (by requiring fewer parameters) on an $8$ -qubit toy environment.

QRL for Planning Tasks

The vehicle routing problem (VRP) is considered by Correll et al. [Cor+23] via an quantum-enhanced attention mechanism. Several parts of a classical encoder-decoder model with attention mechanism [KVW18] are replaced with medium-scale VQCs (up to $10$ qubits). With using quantum methods to implement orthogonal NNs [KLM21], a potential speed-up during inference is reported. Experimental on a simple instance of the traveling salesman problem (TSP) are conducted to support this claim. A simpler approach for the same task is considered in Sanches et al. [San+22], where only the attention heads are replaced with $4$ -qubit VQCs.

The work by Andrés et al. [ACN22] considers different planing tasks related to energy-efficiency scenarios. The authors employ quantum actor-critic methods (see Sec. 4.2.2) to address these tasks. The authors report a slower convergence compared to classical methods, however therefore a reduced parameter complexity. Similar scenarios within the energy context are also discussed by Liu et al. [Liu+23] and Kumar et al. [Kum+23].

The task of satellite mission planning is formulated as a scheduling problem and addressed by Rainjonneau et al. [Rai+23]. The authors apply two different quantum-enhanced methods within this context: (1) policy approximation (see Sec. 4.2.2) with VQCs; (2) replacing several components of AlphaZero with quantum components, similar as to discussed in Ref. [CRC23]; The experiments with $4$ -qubit circuits demonstrate a clear improvement compared to straightforward greedy methods.

The problem of distributing common pool resources is discussed by Shahid and Hassan [SH23]. Quantum-enhanced $Q$ -learning (see Sec. 4.2.1) is applied to an $8$ -qubit toy environment, and superior training performance compared to classical models of similar size is reported.

A task from mobile communication (6G slicing) is considered in Rezazadeh et al. [RKM22]. The authors employ the VQC-based $Q$ -learning approach proposed in Ref. [Che+20] and claim improvements w.r.t. parameter complexity and the potential for distributed computing.

QRL in Collaborative Scenarios

Different tasks that are based on the collaboration of multiple entities are discussed in a series of work by Yan et al. [Yan+22], Park et al. [Par+23, Par+23a, PK23], Yun et al. [Yun+23], Narottama et al. [NS+23], and Ansere et al. [Ans+23]. The foundation is the multi-agent approach QMARL proposed in Ref. [Yun+22] with smaller extensions. On respective toy environments, the approaches demonstrate faster convergence and reduced parameter complexity compared to classical implementations.

QRL for Finances

The work by Cherrat et al. [Che+23b] addresses the task of deep hedging with distributional actor-critic methods. Classical methods are modified with quantum-enhanced orthogonal NNs [KLM21], which promises speed-ups during inference. This is supported by medium-scale hardware test on up to $16$ qubits – which makes this one of the largest-scale demonstrations of VQC-based QRL.

Another work within the context of finances, conducted by J. Yang [Yan23], proposes the use of quantum Q-learning (see Sec. 4.2.1) to speed up calculations.

QRL for Games

The work by Chao et al. [CRC23] thinks back to the origins of classical RL and consider es the board game Orthello, which basically is a simplified version of Go. To solve this toy environment, the authors modify two components of AlphaZero [Sil+18]: (1) replacing function approximators with VQCs; (2) using tensor network methods for feature extraction; For simulations on up to $12$ qubits, the methods show performance compared to classical approaches.

QRL for Architecture Design

S. Y.-C. Chen [Che23d] addresses the task of quantum circuit design. The author uses the actor-critic method QA3C [Che23] to generate circuits that prepare $2$ -qubit Bell states and GHZ states on up to $3$ qubits.

4.3 Projective Simulation for Quantum Reinforcement Learning

Citation	First Author	Title
[BD12]	H. J. Briegel	Projective simulation for artificial intelligence
[Mel+17]	A. A. Melnikov	Projective simulation with generalization
[Boy+20]	W. L. Boyajian	On the convergence of projective-simulation–based reinforcement learning in Markov decision processes
[Pap+14]	G. D. Paparo	Quantum Speedup for Active Learning Agents
[Tei21]	M. Teixeira	Quantum Reinforcement Learning Applied to Games
[TRC21]	M. Teixeira	Quantum Reinforcement Learning Applied to Board Games
[DFB15]	V. Dunjko	Quantum-enhanced deliberation of learning agents using trapped ions
[Sri+18]	T. Sriarunothai	Speeding-up the decision making of a learning agent using an ion trap quantum processor
[Fla+23]	F. Flamini	Towards interpretable quantum machine learning via single-photon quantum walks

Table 27: Work considered for “Projective Simulation for QRL” (Sec. 4.3)

Projective simulation for artificial intelligence, Briegel et al. (2012) and related work

Summary. Projective simulation for artificial intelligence by Briegel et al. [BD12] is the first in a series of articles, which propose a learning scheme for creative behavior. This is understood in the sense that the agent can deal with unseen experiences by relating to other conceivable situations. The method is developed for classical agents. There is only a brief final paragraph, outlining a quantum-mechanical implementation. Since subsequent papers ‘quantize’ the original idea heavily, a brief summary is in order: The approach is based on a random walk on a previous-experience network (memory), simulating an agent pondering its next action. More specifically, previous experiences compose a network of clips, which is dynamically modified by new experiences. It is important to note that clips, in contrast to actual experiences, are e.g. remembered observations, states or actions. To select the next action, an observation of the agent activates a clip, followed by a random walk through the network (projective simulation). This is repeated until an action is ‘excited’ and coupled out from the network and the action is selected. It is worthwhile noting that the term projective as used here is not related to its use in quantum physics, such as in projective measurement.

Action Selection. The process of action selection is slightly more sophisticated than described above. If a percept $s$ is observed, a random walk through the network starts from the corresponding percept clip. After some deliberation time the random walk reaches an action clip, which is only out-coupled and taken in reality if the percept-action pair $(s,a)$ was rewarded in the past (i.e. tagged positively). If not, a new simulation is started. This process repeats until an action clip with positive tag, or a predefined reflection time is reached; in the latter case the action is out-coupled irrespective of the tag.

Learning Procedure. The actual learning process can be summarized as follows:

1.

If a transition $(s,a)$ is rewarded, increase the network weight of the direct transition $s\rightarrow a$ . (Note that the agent might have chosen $s\rightarrow a$ after many steps of PS; by reinforcing the direct transition, it might be exploited directly next time);
2.

Increase the weights of the indirect transition (all weights of the network that led to the transition $s\rightarrow a$ in the random walk through the network). Thus, the agent discovers useful actions after deliberation of fictitious clips;
3.

Introduce dam** of all weights to let the agent forget, in order to be able to adapt to new situations (as appearing for example in a time-dependent environment);
4.

If a new situation is discovered, a corresponding clip is added to the network and directed edges from all the other clips to the new one are added;
5.

Additional extensions can be implemented, such as modifications of clips and creation of completely fictitious compositions of episodes;

This line of research has been continued in Ref. [Mel+17] (generalization) and [Boy+20] (convergence). In the last paragraph of Ref. [BD12] a quantum version of the algorithm is briefly discussed. The idea is to replace the random walk on the network by a quantum walk. A number of subsequent papers investigate the quantum approach more rigorously:

In order to define a quantum walk algorithm as done in Ref. [Pap+14], the PS approach is viewed slightly different. The given clip network with the percept set $S$ is separated in $|S|$ disjoint networks. Thus one obtains a directed weighted graph (a Markov chain) for each percept with action clips as absorber states. Each of the actions is initially flagged (corresponding to the emotion tags of the initial projected simulation proposal). If an actual out-coupled action did not lead to a reward, this particular flag is removed. Now the action selection proceeds in the following way: If the agent observes a percept $s$ , a random walk starts through the graph (deliberation) until an action is reached, which is out-coupled only if the action is flagged (reflection). Thus, action selection corresponds to sampling from the conditional probability distribution over the flagged action space. Given the transition matrix $P$ of the Markov chain, subsequent applications of $P$ to the initial state (probability one for the percept clip) realizes the approximate stationary distribution (subsequently referred to as diffusion). Sampling from this distribution and disregarding un-flagged actions produces the correct samples. As $p_{s}$ is the probability to sample a flagged action from the equilibrium distribution obtained by diffusion, one needs to repeat the sampling process $\mathcal{O}(1/p_{s})$ times until a flagged action is sampled. The quantum random walk search algorithm is closely related to Grover’s algorithm. By elevating the transition matrix to a diffusion operator and introducing an oracle that marks flagged actions, the quantum algorithm only needs $\mathcal{O}(1/\sqrt{p_{s}})$ oracle calls. Consequently, a quadratic speed-up for the deliberation process can be achieved. Therefore, this quantum algorithm speeds up the agent’s internal computation time for action selection. This technique is extended and applied to board games in Refs. [Tei21, TRC21].

Experimental Implementation. In Ref. [DFB15] the authors investigate the implementation of the algorithm proposed by Ref. [Pap+14] on an ion trap quantum computer. Results are also backed up by numerical simulations. The actual proof-of-principle experiment with two qubits is discussed in Ref. [Sri+18], where signatures of the quadratic speed up are observed. Ref. [Fla+23] proposes a quantum-optics based implementation of the projective simulation paradigm. Here, the random walk through the clip network is promoted to a quantum walk of a single photon through an optical interferometer. Outcoupling of an action then corresponds to an occupation number measurement of output modes.

4.4 Boltzmann Machines for Quantum Reinforcement Learning

Citation	First Author	Title
[Jer+21a]	S. Jerbi	Quantum Enhancements for Deep Reinforcement Learning in Large Spaces
[Cra+18]	D. Crawford	Reinforcement Learning Using Quantum Boltzmann Machines
[Sch+22]	M. Schenk	Hybrid actor-critic algorithm for quantum reinforcement learning at CERN beam lines
[Lev+17]	A. Levit	Free energy-based reinforcement learning using a quantum processor

Table 28: Work considered for “Boltzmann Machines for QRL” (Sec. 4.4)

Quantum Enhancements for Deep Reinforcement Learning in Large Spaces, Jerbi et al. (2021) and related work

Summary. The work presented in Ref. [Jer+21a] investigates an alternative NN architecture to those often used for learning the $Q$ -function (or more generally the merit function) in RL tasks. The authors argue that these alternative models perform advantageously in large action spaces. This is due to their capability to represent multimodal functions better than standard network architectures, while using a similar number of parameters. It is further found that these alternative architectures are closely related to energy-based models, some of which admit quantum representations. In turn, this allows quantum evaluations, enabling a provable quantum speed-up for fault-tolerant quantum computing.

Motivation. The standard architecture for $Q$ -learning with NNs is depicted in Fig. 16 (upper part). The representation of a state is fed into a NN, which outputs the values of the so-called merit function (the $Q$ -value in case of $Q$ -learning) for each possible action (given the state). The policy can be derived from this function by with softmax post-processing. The effective-temperature parameter is decreased over time to reduce exploration and enhance exploitation.

The authors argue that this NN architecture is not suited for large action spaces. It has to output a high dimensional function, i.e. the merit functions for all actions simultaneously for a given state. The authors argue that this network is unable to approximate a multimodal merit function in case of complex state-action correlations. Instead, the authors discuss the NN structure shown in the lower part of Fig. 16. Here, the state and action is fed to the NN, which outputs the corresponding merit function. Action selection is done by sampling from the probability distribution, given by a softmax function on the values of the merit function. Therefore, sampling requires $|A|$ forward passes, where $A$ is the action set, making action selection a computationally expensive task for large action spaces.

Experiments are conducted on a generalized GridWorld environment with a large set of actions. The associated complex transition function gives rise to one optimal and many sub-optimal policies. The authors find that the NN architecture shown in the lower part of Fig. 16 indeed performs better, but at the cost of the expensive sampling described before.

Energy-based Models. The potential for quantum speed up comes from the observation that the second architecture in Fig. 16 is equivalent to a certain kind of energy-based model. Energy-based function approximators are used for generative modeling of probability distributions based on the Boltzmann-Gibbs distribution with respect to an energy functional. Boltzmann machines are one instance of such energy-based models where the energy functional is given by a spin-spin interaction model. However, Boltzmann machines are hard to train which led to the development of restricted Boltzmann machines where a special interaction structure with a hidden layer enables more efficient training. In Ref. [Jer+21a] the authors observe that the lower architecture in Fig. 16 is equivalent to a generalized form of restricted Boltzmann machines.

Quantum Speed-Up. Inspired by this insight, the authors next investigate quantum energy-based models. Here, the classical spin-spin interaction energy is promoted to a spin Hamiltonian, known as quantum Boltzmann machines and restricted quantum Boltzmann machines. Some of these models allow efficient training, while the hardness of sampling remains. To speed up sampling in the classical and quantum setting, the following quantum subroutines for a RL algorithm are discussed:

(1) Quantum Gibbs sampling: The Gibbs-Boltzmann distribution is prepared as a qsample, from which expectations values can be sampled with quadratic speed-up, compared to classical Monte-Carlo sampling methods. (2) Gibbs-state preparation by Hamilton simulation: Using Hamilton-simulation techniques, an approximation to the Gibbs qsample can be prepared, leading to quadratic speed-up compared to exact sampling (calculating all energies and explicitly normalizing the probability distribution). (3) Quantum simulated annealing: This method uses a quantum method for the approximate Monte-Carlo sampling of the Gibbs state itself by leveraging quantum random walks on graphs.

All methods discussed so far need oracularized access to the Hamiltonian and it is unlikely that they could be realized on current hardware. A realization on near-term hardware might be achieved by (4) Variational Gibbs-state preparation: Here, a variational circuit can be employed to approximate a Gibbs qsample, using the free energy as an objective. Any quantum speed up, however, for this method is heuristic and has not been made rigorous so far.

Remarks. Related work [Cra+18, Sch+22, Lev+17] proposes models based on quantum Boltzmann machines for quantum annealing hardware. Since this literature survey focuses on algorithms proposed for gate-based QC, we do not include a detailed summary here.

4.5 Quantum Policy and Value Iteration

So far, we have considered QRL algorithms that employ QC for function approximation or propose quantum approaches to alternative learning frameworks such as PS. We now turn to proposals that replace subroutines of existing RL frameworks by quantum algorithms such as amplitude estimation, quantum maximum finding and, respectively, quantum matrix inversion. As a result, the proposed QRL algorithms guarantee improved sample or computational complexity. As these methods need oracular access to the environment, they should be categorized as post-NISQ algorithms.

Citation	First Author	Title
[Wan+21]	D. Wang	Quantum algorithms for reinforcement learning with a generative model
[Gan+23]	B. Ganguly	Quantum Computing Provides Exponential Regret Improvement in Episodic Reinforcement Learning
[Zho+23]	H. Zhong	Provably Efficient Exploration in Quantum Reinforcement Learning with Logarithmic Worst-Case Regret
[GA23]	B. Ganguly	Quantum Acceleration of Infinite Horizon Average-Reward Reinforcement Learning
[CKP23]	E. A. Cherrat	Quantum Reinforcement Learning via Policy Iteration
[Wie+22]	S. Wiedemann	Quantum Policy Iteration via Amplitude Estimation and Grover Search - Towards Quantum Advantage for Reinforcement Learning

Table 29: Work considered for “Quantum Policy and Value Iteration” (Sec. 4.5)

Quantum algorithms for reinforcement learning with a generative model, Wang et al. (2021)

Summary. The work in Ref. [Wan+21] proposes two algorithms for RL with a generative model and rigorously derives bounds for their sample complexity.

Classical Generative Models. Classically, the term generative model describes a simulator, which queried with a state-action pair $(s,a)$ , produces a sample $s^{\prime}\sim P(\cdot|s,a)$ . Thus, by repeated sampling for each state-action pair, one can estimate the transition matrix of the underlying MDP. This allows to subsequently obtain an approximation of the optimal policy by means of value iteration. Over the years there has been tremendous effort devoted to improving sample efficiency (defined as the number of times the simulator has to be queried). This performance metric is meaningful if one assumes that every query of the simulator is costly. The best classical algorithm [Li+20a] requires a total number of $\mathcal{O}(|S||A|\Gamma^{3}/\epsilon^{2})$ samples, where $|S|$ and $|A|$ are the number of states and actions, $\Gamma=1/(1-\gamma)$ is the effective horizon of the MDP, and $\epsilon$ is the deviation of the optimal value function from the approximation. The sample complexity is linear in the product $|S||A|$ , since the transition matrix has to be estimated for each $(s,a)$ . The factor $1/\epsilon^{2}$ originates from Hoeffding’s inequality (indeed bounding the deviation of a sample average from its real value by $\epsilon$ , requires $\mathcal{O}(1/\epsilon^{2})$ samples). The origin of the third power of $\Gamma$ , in contrast, is less intuitive. Note that the sample complexity of the classical algorithm is also a lower bound (in the classical case) and therefore optimal.

Incorporating Quantum Subroutines. As shown in Ref. [Wan+21], the classical sample complexity can be reduced by replacing the classical mean-estimation subroutine in Ref. [Li+20a] by a quantum routine based on the quantum mean-estimation algorithm [Bra+02]. Even though the optimal classical algorithm is more sophisticated as outlined above and so is its quantization, the following discussion captures the essential features. The quantum subroutine requires the generative model in oracle form and can then be used to estimate the expectation value $\mathbb{E}(V)=\sum_{s^{\prime}}P(s^{\prime}|s,a)V(s^{\prime})$ (which appears in the Bellman equation) individually for every pair $(s,a)$ in time $\mathcal{O}(1/\epsilon)$ . This quadratic speed-up originates from Grover’s algorithm, on which the quantum-mean estimation algorithm is based upon. As a consequence, the quantum-policy iteration algorithm achieves the sample complexity $\mathcal{O}(|S||A|\Gamma^{1.5}/\epsilon)$ with an quadratic improvement in $\Gamma$ and $\epsilon$ .

The dependence on the size of the action space can be further reduced by using quantum maximum finding [Mon15] to calculate the maximum over actions in the Bellman optimality equation. However, using this quantum routine, one can not fully exploit the power of the classical optimal algorithm. Hence, while the dependence on $|A|$ is reduced quadratically and the improvement in $\epsilon$ is kept, the improvement in $\Gamma$ is lost. As a result, the algorithm based on both quantum-mean estimation and quantum maximum finding achieves a sample complexity $\mathcal{O}(|S|\sqrt{|A|}\Gamma^{3}/\epsilon)$ .

Finally, the lower bound $\mathcal{O}(|S||A|\Gamma^{1.5}/\epsilon)$ is derived and possible improvements of the algorithm to reach this limit are discussed.

Quantum computing provides exponential regret improvement in episodic reinforcement learning, Ganguly et al. (2023)

Summary. In Ref. [Gan+23] and independently in Ref. [Zho+23] the authors consider the problem of an agent operating in a finite-horizon episodic tabular MDP and investigate if quantum computing can alleviate the exploration-exploitation trade-off. This problem has been considered for the case of bandits [Wan+23, LHT22, LZ22] but is here generalized to the full multi-state RL problem. In the online setting, the agent only has access to the next state and reward given its current state and chosen action. In contrast, previous work [Wan+21] assumed access to a generative model, which can be queried with arbitrary state-action pairs producing samples of the next state and reward. This setting does not consider the exploration-exploitation trade-off that arises from online interaction with the environment. Here, the agent must learn to discover high-reward states by a suitable exploration strategy. The performance of the agent in this problem can be measured by the regret, which is defined as the cumulative difference between the optimal value function and its approximation after $K$ episodes. The goal is to design an algorithm with the weakest scaling of the regret in $K$ , indicating a more effective trade-off between exploration and exploitation. The classical UCB-VI algorithm achieves the lower bound $\Omega(\sqrt{K})$ [JOA10, AOM17] of the regret. The proposed quantum algorithm in Ref. [Gan+23] builds upon this classical algorithm by replacing the mean estimation routine with a quantum algorithm. Given a state-action pair, the quantum algorithm assumes a ‘transition oracle’ which generates a quantum superposition over all possible next states with amplitudes given by the square root of the respective transition probabilities. A similar oracle is used for generating rewards. The algorithm utilizes the quantum multivariate mean estimation algorithm [Ham21], which reduces the number of samples required to satisfy a given error bound for mean estimation quadratically. The result is a decrease of the regret of the quantum algorithm from $\mathcal{O}(\sqrt{K})$ to $\mathcal{O}(1)$ up to logarithmic factors. This is an exponential improvement over classical results. In a follow-up work by the same authors [GA23], the results were extended to infinite horizon problems, where an exponential reduction in regret from $\mathcal{O}(\sqrt{T})$ to $\mathcal{O}(1)$ ( $T$ being the total number of time steps) is achieved. Additionally, Ref. [Zho+23] consideres linear function approximation and demonstrates that the exponential improvement is maintained.

Quantum Reinforcement Learning via Policy Iteration, Cherrat et al. (2023)

Summary. Ref. [CKP23] proposes a quantum algorithm for an iterative scheme of $Q$ -value evaluation and policy improvement. The algorithm evaluates the $Q$ -value on a quantum computer, with the state vector representing the $Q$ -values, being extracted by measurements. The policy afterwards is improved on a classical device. The algorithm can achieve quantum advantage in certain situations.

To set up the general framework, the authors first formulate the Bellman equation for $Q$ -value evaluation as a matrix equation [LP03]

Q=R+\gamma P\Pi Q\,.

Denoting the size of the action and state space as $|A|$ and $|S|$ , the $|A||S|$ dimensional vectors $Q$ and $R$ represent the $Q$ -values and the reward vector, respectively; the environment transition function is the $|A||S|\times|S|$ dimensional matrix $P$ ; the policy is represented by an $|S|\times|A||S|$ -dimensional matrix $\Pi$ ; $\gamma$ denotes the usual discounting factor; The authors propose to compute $(1\!\!1-\gamma P\Pi)^{-1}R$ on a quantum device.

Quantum Subroutine: Block Encodings and Linear Algebra. To perform this task, Ref. [CKP23] relies on so-called block encodings of matrices [Gil+19]. This powerful framework gives rise to various quantum algorithms for encoding general complex (not necessarily rectangular) matrices in the leading principal block of a larger unitary matrix. Once the data has been loaded, the framework further provides linear-algebra routines such as matrix multiplication, addition [Gil+19] and inversion [CKS17]. The encoding algorithms need quantum access to the data, i.e. via oracles. Therefore, the methods can be attributed to the post-NISQ algorithms category. A well-known data-loading scheme is the sparse-input model, viable for sparse matrices. The authors of Ref. [CKP23] apply a more general scheme, the so-called $\mu_{p}(A)$ [CGJ19] block encoding of a matrix $A$ . Here, the quality (i.e. the probability to obtain the correct output of the algorithm, e.g. after matrix-vector multiplication and a subsequent measurement) of the encoding depends on the maximum of the column and row norms of the matrix. The aforementioned norm is a function of $p$ and can be chosen freely to optimize the encoding quality. Based on this formalism, the authors show that policy evaluation requires time

\mathcal{O}(\mu_{P}\Gamma\mathrm{polylog}(|S||A|\Gamma/\epsilon))\,.

(49)

In Eq. 49, the parameter $\Gamma=(1-\gamma)^{-1}$ , $\epsilon$ denotes the accuracy of the matrix inversion subroutine. The term $\mu_{P}$ describes the quality of the encoding of the environment-transition matrix, which depends on the structure of the environment. In the worst case it scales as $\sqrt{|S||A|}$ . Due to the sparsity of the transition function of many environments, a better scaling is often expected. As discussed below, for the frozen-lake environment one even finds $\mu_{P}=\mathcal{O}(1)$ . The complexity in Eq. 49 assumes an efficient loading routine for the matrices. To achieve efficient loading also for the policy matrix, a QRAM data structure for the policy needs to be constructed. This needs to happen in time $\mathcal{O}(|S||A|)$ for each policy-evaluation step. Afterwards, the matrix can be loaded efficiently for each cycle of the measurement protocol.

Classical Subroutine: Policy Improvement. The policy improvement step on a classical device requires reading out the $Q$ -vector from the quantum computer after matrix inversion. Naively, one would expect that the measurement process introduces exponential overhead. However, since convergence results for the Bellman equations are based on the maximum norm ( $L_{\infty}$ norm), the authors employ $L_{\infty}$ -norm state tomography [KP20]. This is efficient, i.e. requires $\mathcal{O}(1/\epsilon^{2})$ shots, where $\epsilon$ now is the target accuracy for the optimal $Q$ -values (under $L_{\infty}$ -norm). Consequently, the overall time complexity (neglecting logarithmic terms) of the algorithm is

\mathcal{O}(|S||A|+\mu_{P}\Gamma/\epsilon^{2})\,.

(50)

In Eq. 50 the factor $1/\epsilon^{2}$ appears in the second term since the matrix inversion subroutine is called for each of the $1/\epsilon^{2}$ shots. The first term is the classical complexity of calculating the $argmax$ function for policy improvement and construction of the policy oracle prior to each evaluation step.

Example Environments. The authors consider the FrozenLake and the InvertedPendulum environments as examples. We will briefly discuss the insights from the former here: The simple form of the environment allows choosing $\mu_{P}=1/2$ , which thus is independent of the size of the action and state space. Note that the gate complexity is still of the order of $|S||A|$ . It only becomes efficient for special structured instances of the environment such as all ‘holes’ on the diagonal of the grid.

Quantum advantage. The leading term in Eq. 50 is linear in $|S|$ and $|A|$ , showing a speed-up with respect to classical linear-system of equations solvers. These exhibit complexity $\mathcal{O}((|S||A|)^{\omega})$ , with $\omega>1$ , and vanilla $Q$ -value iteration with complexity $\mathcal{O}(|S|^{2}|A|)$ . Even though a more detailed characterization of possible quantum advantage is not provided in Ref. [CKP23], it is clear that the speed up can be at most polynomial.

Least-Squares Policy Iteration. Finally, the authors generalize the method to least-squares policy iteration [LP03], where the $Q$ -vector is approximated by a set of basis functions. For details refer to Refs. [LP03, CKP23].

Quantum Policy Iteration via Amplitude Estimation and Grover Search - Towards Quantum Advantage for Reinforcement Learning, Wiedemann et al. (2022)

Summary. In the QRL scheme proposed in Refs. [Wie+22, Wie21], a policy is evaluated by constructing a superposition of all possible trajectories of an MDP with fixed-horizon and with finite action and state space. Making use of amplitude estimation [Bra+02], the number of calls to a state-transition oracle for estimation of the value function (up to some fixed additive error) can be quadratically reduced. A second algorithm finds the optimal policy in the policy space quadratically faster compared to direct policy search by means of Grover’s algorithm.

First Algorithm. The first algorithm assumes access to a policy oracle $\Pi$ and an environment oracle $E$ which act on an initial state $|s\rangle$ as

\Pi(|s\rangle|0\rangle_{\mathcal{A}})=\sum_{a}\sqrt{\pi(a|s)}|s\rangle|a\rangle

E(|s\rangle|a\rangle|0\rangle_{\mathcal{R}}|0\rangle_{\mathcal{S}})=\sum_{r,s^% {\prime}}\sqrt{p(r,s^{\prime}|s,a)}|s\rangle|a\rangle|r\rangle|s^{\prime}% \rangle\,.

Applying these operators sequentially on partially fresh registers as shown in Fig. 17 results in a superposition of all possible trajectories

\ket{t}=|s_{0}\rangle|a_{0}\rangle|r_{1}\rangle|s_{1}\rangle\,.\,.\,.\,|r_{H}% \rangle|s_{H}\rangle

where $H$ is the horizon of the MDP, such that the quantum state reads

|\psi^{\pi}\rangle=\sum_{t}\sqrt{p_{t}}|t\rangle|G_{t}\rangle\,.

Here, $p_{t}$ is the probability of trajectory $t$ . An additional unitary operator has been applied that calculates the return $G_{t}$ of trajectory $t$ and encodes the value into an additional register entangled with the corresponding trajectory. The superscript $\pi$ on $|\psi^{\pi}\rangle$ denotes that the state corresponds to the superposition of trajectories for a given policy $\pi$ .

The next step of the algorithm attaches an ancilla qubit. With bit-by-bit rotations of the state the digital encoding of $G_{t}$ is transformed into amplitude encoding (assuming here for simplicity $G_{t}\in[0,1]$ ). A simple calculation reveals that the probability of finding the ancilla qubit in state $|1\rangle$ is given by the average return, that is the value function of the initial state $s$ . With this insight in mind, the authors propose amplitude estimation [Bra+02]. This involves the phase-estimation algorithm, to extract the value function. While classically sampling from the superposition of trajectories would require $\mathcal{O}(1/\epsilon^{2})$ preparations of the state, the quantum algorithm achieves the same error with $O(1/\epsilon)$ , resulting in a quadratic speed-up. Hereby, $\epsilon$ denotes the fixed additive error to which the value function is to be determined.

Second Algorithm. The second algorithm shown in Ref. [Wie+22] is a quantum version of direct policy search. The authors propose to create a superposition

\frac{1}{\sqrt{|\text{P}|}}\sum_{\pi}|\pi\rangle|\psi^{\pi}\rangle

where $|\pi\rangle$ is a digital representation of the policy, $|\psi^{\pi}\rangle$ the superposition of all trajectories corresponding to policy $\pi$ as before, and $|P|$ the size of the policy space. Quantum minimum finding [DH96] can now be applied to find the optimal policy (the one with maximal expected return starting from initial state $s$ ), requiring only $\mathcal{O}(\sqrt{|P|})$ preparations of the state. This is opposed by $O(|P|)$ in classical direct policy search. Note, however, that the space of all policies scales as $O(|A|^{|S|})$ , where $|A|$ and $|S|$ are the sizes of action and state space, respectively. Consequently, the quantum algorithm scales exponentially worse compared to policy iteration where the Bellman optimality equation is iterated with polynomial complexity in $|S|$ and $|A|$ . The method proposed in Refs. [Wie+22, Wie21] therefore should be seen as a quantum version of direct policy search.

4.6 Quantum Reinforcement Learning with Oracularized Environments

In this final section we summarize work that proposes fully quantum-mechanical approaches to QRL. In the articles we survey below, the environment is a quantum system or oracle that can be queried by superpositions of states and actions. Interactions with a quantum-mechanical agent create superpositions of trajectories as input for subroutines like Grover search, quantum-maximum finding, and amplitude estimation. Provable quantum advantage renders some of these proposals interesting candidates for the post-NISQ era.

Citation	First Author	Title
[DTB16]	V. Dunjko	Quantum-Enhanced Machine Learning
[DTB15]	V. Dunjko	Framework for learning agents in quantum environments
[DTB17]	V. Dunjko	Advances in quantum reinforcement learning
[HDW21]	A. Hamann	Quantum-accessible reinforcement learning beyond strictly epochal environments
[Wan+21a]	D. Wang	Quantum exploration algorithms for multi-armed bandits
[Wan+23]	Z. Wan	Quantum Multi-Armed Bandits and Stochastic Linear Bandits Enjoy Logarithmic Regrets
[Sag+21]	V. Saggio	Experimental quantum speed-up in reinforcement learning agents
[HW22]	A. Hamann	Performance analysis of a hybrid agent for quantum-accessible reinforcement learning
[Cor18]	A. Cornelissen	Quantum gradient estimation and its application to quantum reinforcement learning

Table 30: Work considered for “QRL with Oracularized Environments” (Sec. 4.6)

Quantum-Enhanced Machine Learning, Dunjko et al. (2016) and related work

Summary. In Ref. [DTB16] and in a more detailed preprint [DTB15] a general framework of an agent-environment interaction where both entities are quantum-mechanical systems is developed. To query the environment by a superposition of action states (intuitively the agent learns in parallel), clearly the environment must be modeled by some form of an oracle. As it turns out, this oracularization is much more involved than one might naively think. The focus of the work is therefore:

•

Formalizing a quantum mechanical version of agent-environment interaction
•

Investigation of the classical limit
•

Properties of the general quantum mechanical set-up
•

Treatment of special oracularizable environments
•

Identification of quantum advantage for these environments

General Setup. The interaction between agent and environment is modeled as shown in Fig. 2a in Ref. [DTB15]. The register $R_{A}$ processes the computations of the agent, while the register $R_{E}$ represents the environment. The communication register stores one action and one state. The interaction is described by completely positive trace preserving (CPTP) maps or, if we wish, unitary maps on a larger system. The first map $M_{1}^{E}$ outputs the initial state and stores it into the communication register. The map $M_{1}^{A}$ (modeling the agent) reads this state and, after some processing on $R_{A}$ , outputs an action state which is added to the communication register. Now this action processed by $M_{2}^{E}$ , which outputs a new state. Consecutively, the previous state in the communication register is overwritten, and so on. The particular form of the states of $R_{C}$ (if in superpositions of action or not) will be discussed later.

While $R_{C}$ only contains a state-action pair, the agent’s register stores all previous states and actions (because the next action proposed by the learning algorithm depends on all actions and states encountered before, note here the distinction between algorithm and policy). The same is true for the (in general non-Markovian) environment.

Next, as shown in Fig. 18, a tester register $R_{T}$ is introduced, which is designed to ‘observe’ the elapsed history (all encountered states and actions during a learning sequence). This copying from $R_{C}$ to $R_{T}$ is modeled by controlled unitaries (so they do not modify $R_{C}$ ). Each of them act on a fresh part of the register $R_{T}$ .

The term copying the register here means that a superposition of computational basis states is concatenated with a second register, on which then each basis state is copied to. This produces in general a highly entangled state, which cannot be factorized into the initial state on the first register and a copy on the second (note the no-cloning theorem only rules out a transformation producing this factorized copy for a general initial state). The most general form of the tester interaction treated in this work allows additional unitary transformations, such that the copying can be described in the form of controlled unitaries. A tester interaction that merely copies the states will be referred to as classical.

After training, the register $R_{T}$ contains the sequence of actions and states, the so-called history. Any metric measuring performance of learning can be phrased as a function of the history probabilities. Therefore, it can be formulated as the expectation value of an observable on $R_{T}$ .

Classical Limit. For recovering the classical learning set-up, the notion of classical interaction is defined by restricting the form of the maps, such that the state in $R_{A}-R_{C}-R_{E}$ remains separable (note that no entanglement between the registers does not prohibit entangled agent or environment states, thus quantum mechanical environments and agents equipped with a quantum computer are not excluded). Additionally, the tester interaction is supposed to be classical (in the sense as defined above). For this setup it is shown that for every scenario with separable register state there exists a classical environment and a classical agent that produce the same history. Consequently, no quantum improvements are possible. Hence, there can be no improvement in the figure of merit, even when the agent has access to a quantum computer.

General Quantum-Mechanical Set-Up. What happens when we allow general maps and general states on the registers? The authors prove that the state on $R_{T}$ is still an incoherent mixture, and therefore no quantum advantage can be expected. The reason for this result lies in the memory of agent and possibly the environment: The agent in general has to remember all previous encountered states and actions, because the learning algorithm run by the agent is a function of that particular elapsed history. The quantum state therefore is a superposition of histories entangled with a state, which describes the agent that has seen this particular history. The states of this agent are orthogonal, since a different agent state translates into a different bit state of the memory. Thus, when tracing out these degrees of freedoms, the resulting reduced density matrix on $R_{T}$ is an incoherent mixture and no quantum advantage can be achieved in the figure of merit. (Side remark: This does not exclude a quantum advantage in terms of computational complexity in the internal processing of the agent. The result is about exploiting the ‘quantumness’ of the environment-agent interaction)

We note that one has to be careful with the interpretation of density matrices. One might be inclined to think that an incoherent mixture of history states weighted by their probability in some sense corresponds to traversing all of the histories simultaneously but note that the correct expectation value with respect to this density matrix is only obtained in the limit of infinitely many runs corresponding to sampling trajectories one after another.

Oracularization of Environments. The next part of the work focuses on a special class of environments and learning setting without memory, which overcome the decoherence problem. These oracularized environments are of the following form:

•

episodic with fixed horizon $\rightarrow$ fixed sequence of interactions
•

deterministic $\rightarrow$ action sequence fully determines the history, states can be disregarded
•

binary rewards issued at final state $\rightarrow$ allows use of Grover search

Quantum Advantage. With these assumptions a proper oracle can be constructed, that can be queried with a superposition of actions. This allows to use it as a phase flip oracle, as in the Deutsch-Jozsa or Grover algorithm. The time required for finding a rewarded-action sequence is therefore quadratically reduced. Consequently, this setting is meaningful for learning tasks, where the reward is very sparse. That is, the agent cannot learn until it has first seen a reward. After this initial exploration phase, the agent can now be further trained in simulation. Finally, some of the assumptions are relaxed. The authors also show, how stochastic oracles can be constructed.

Further Work. There is further work that builds upon the results of Refs. [DTB15, DTB16]. In Ref. [DTB17], the algorithm is applied to the optimization of parameters describing the properties of the agent (hyperparameter). It also discusses the notion of register hijacking, where the agent has access to hidden memory registers of the environment. This assumption allows the oracularization of more general environments, which is also discussed in Ref. [Dun+18]. This class is further generalized in Ref. [HDW21] beyond episodic environments. A closer investigation of amplitude amplification techniques for the special case of multi-armed bandits environments is conducted in Refs. [Wan+21a, Wan+23]. In Ref. [Sag+21], the learning setting is implemented experimentally for a two-qubit system and an experimental quantum advantage is observed. Finally, the performance of an agent in this setting is investigated in Ref. [HW22].

Quantum gradient estimation and its application to quantum reinforcement learning, Cornelissen (2018)

Summary. The master’s thesis [Cor18] considers model-based RL and develops quantum algorithms for policy evaluation and policy optimization. For the former method a quadratic improvement in sample complexity is found.

Quantum Policy Evaluation. A quantum algorithm for quantum policy evaluation is presented in Sec. 6.2 of the thesis and will be summarized in the following: The algorithm is executed on a register that is capable to store $T$ states and actions of a $T$ -step MDP. To generate a sequence, a transition-probability oracle and a policy oracle are defined. They generate a superposition of all possible action-state sequences of the Markov problem, weighted by the square root of the corresponding probabilities. Note that the state is normalized as the probabilities sum up to one. Next, a reward oracle is defined which, when acting on a state-action pair, multiplies the state with a phase factor. The phase is the discounted reward for this state action pair. The discount factors are introduced by making use of fractional phase oracles. This is discussed in detail in Sec. 4 and 5 of the thesis, which are based on Refs. [GAW19, Gil+19]. The fractional reward oracle is applied to every state-action pair in the register, resulting in the product of phase factors containing the individual discounted rewards. Thus, when merging the exponentials to one exponential, the full quantum state is a superposition of all state-action sequences, weighted by the square root of the individual probability and a phase factor containing the corresponding return. Next, it is shown how the phase factor can be encoded in the amplitude by a controlled operation on an ancilla qubit. Consequently, the probability of measuring the ancilla in, say, state $\ket{0}$ is given by the expectation value of the return, that is the value function. It can be measured using quantum-amplitude estimation, which works based on the phase estimation algorithm. The amplitude-estimation algorithm is a Grover-type algorithm. Hence, it is not surprising that the quadratic speed up compared to classical Monte-Carlo sampling results from this algorithmic step.

Quantum Policy Optimization. In Sec. 6.4 of the thesis a policy optimization algorithm is developed. This method can be seen as a quantum analogue of policy gradient. First of all, the policy needs to be parameterized. This is done by introducing the parameters $x_{sa}$ such that $\pi(a|s)=x_{sa}$ for all $a$ but one arbitrarily chosen $a^{*}$ and $\pi(a^{*}|s)=1-\sum_{a}x_{sa}$ otherwise. By that definition, the policy is properly normalized and all $x_{sa}\in[0,1]$ . Consequently, the expected return is a high-dimensional polynomial in the parameters $x_{sa}$ . For taking the derivative of this objective, Jordan’s quantum gradient algorithm [Jor05] in it’s advanced form [GAW19] is employed. This leads to a finite-difference approximation of the gradients, written in a phase factor, which can be read out after applying phase estimation. Following Ref. [Gil+19], significant amount of work is devoted to transform the probability oracle for the policy and the transition matrix described above into a phase oracle. Once the superposition of state-action sequences is prepared, an oracle call multiplies each state in the superposition by the corresponding discounted reward. Consecutively, the gradient estimation algorithm is applied and the gradients can be read out. This step is followed by adapting the policy through gradient ascent. It is concluded in the thesis that this policy optimization algorithm does not necessarily lead to quantum speed-up. However, as the author argues, it is conceivable that improvement of the algorithm might lead to a quantum speed-up.

5 Outlook

We have given a rather detailed account of the various instances QRL that have appeared throughout the literature. We observed, that the dichotomy found at the hardware level, i.e., currently available NISQ devices vs. fault-tolerant and error-corrected QPUs, manifests also at the algorithmic level.

With NISQ devices in mind, VQCs have been suggested as function approximators. These replace their classical counterparts in RL algorithms with function approximation in policy space, value space, or both. Here, one typically replaces a classical learning heuristic by a learning heuristic with a quantum component. Any sort of potential quantum advantage, however, is not immediately apparent. We eventually can obtain theoretical insight into the properties of VQCs viewed as ML models and function approximators. However, a direct comparison to their classical cousins, such as neural networks, is anything but easy and might strongly depend on the chosen metric. How can we meaningfully deploy an agent trained with VQC-components? What are the requirements for quantum advantage in such a heuristic setting? What does non-simulability of quantum circuits imply for e.g. generalization bounds of VQCs as ML models? Can we scale VQCs while maintaining their desirable properties? What is the intrinsic inductive bias of VQCs viewed as ML models? What are the implications for RL and its application domains? All these questions are currently being investigated in the research community, and we are looking forward to new results.

While quantum algorithms for fault-tolerant and error-corrected QPUs have been put forward, we are still far from being able to deploy these algorithms for meaningful problem sizes. Given the necessary advancements of hardware platforms, it will be exciting to see whether these types of quantum algorithms will become competitive with classical learning approaches in practice.

We hope that our survey on the QRL literature and the various types of QRL algorithms will help guide newcomers to the field and will serve as a valuable reference for researchers.

Acknowledgments

We acknowledge collaboration and exchange with M. Franz, L. Wolf, M. Schönberger and W. Mauerer as well as M. J. Hartmann on the topic of quantum reinforcement learning. We further acknowledge exchange and discussion with W. Hauptmann, D. Hein, S. Udluft, V. Tresp, Y. Ma, A. Auer, M. Weber, B. Bisgin, L. Bleiziffer, C. Mendl, S. Wiedemann, S. Wölk, J. M. Lorenz, M. Monnet, T.-A. Dragan, G. Kruse and G. Kontes. We would like to thank M. Leib for feedback on an early version of the manuscript. This work was supported by the German Federal Ministry of Education and Research (BMBF), funding program “quantum technologies – from basic research to market”, grant number 13N15645.

Acronyms

BCQ: batch-constrained deep $Q$ -learning
BCQQ: batch-constrained quantum $Q$ -learning
CNN: convolutional neural network
CPTP: completely positive trace preserving
CQ2L: conservative quantum $Q$ -learning
CQL: conservative $Q$ -learning
CTDE: centralized training with decentralized execution
DDQL: double deep $Q$ -learning
DL: deep learning
DLP: discrete logarithm problem
DNN: deep neural network
DQAS: differential quantum architecture search
DQL: deep $Q$ -learning
DQN: deep $Q$ -network
DRL: deep reinforcement learning
DRU: data re-uploading
FIM: Fisher information matrix
MARL: multi-agent reinforcement learning
MDP: Markov decision process
ML: machine learning
MPS: matrix product state
MSE: mean square error
NISQ: noisy intermediate-scale quantum
NN: neural network
PDF: probability density function
POMDP: partially observable Markov decision process
PPO: proximal policy optimization
PS: projective simulation
QA3C: quantum asynchronous advantage actor critic
QC: quantum computing
QCNN: quantum convolutional neural network
QDDPG: quantum deep deterministic policy gradient
QiRL: quantum-inspired reinforcement learning
QLSTM: quantum long short-term memory
QMARL: quantum multi-agent reinforcement learning
QML: quantum machine learning
QNN: quantum neural network
QNPG: quantum natural policy gradient
QPG: quantum policy gradient
QPU: quantum processing unit
QRL: quantum reinforcement learning
QRNN: quantum recurrent neural network
RL: reinforcement learning
SAC: soft actor-critic
TD: temporal difference
TN: tensor network
TSP: traveling salesman problem
VQA: variational quantum algorithm
VQC: variational quantum circuit
VQ-DQN: variational quantum deep $Q$ -networks
VQE: variational quantum eigensolver
VRP: vehicle routing problem

References

[Abb+21] Amira Abbas et al. “The power of quantum neural networks” In Nat. Comput. Sci. 1.6, 2021, pp. 403–409 DOI: 10.1038/s43588-021-00084-1
[ACN22] Eva Andrés, Manuel Pegalajar Cuéllar and Gabriel Navarro “On the use of quantum reinforcement learning in energy-efficiency scenarios” In Energies 15.16, 2022, pp. 6034 DOI: 10.3390/en15166034
[ACN23] Eva Andrés, MP Cuellar and G Navarro “Efficient Dimensionality Reduction Strategies for Quantum Reinforcement Learning” In IEEE Access 11, 2023, pp. 104534–104553 DOI: 10.1109/ACCESS.2023.3318173
[Acu+22] Alberto Acuto et al. “Variational quantum soft actor-critic for robotic arm control” In arXiv:2212.11681, 2022 DOI: 10.48550/arXiv.2212.11681
[AHF20] Ramin Ayanzadeh, Milton Halem and Tim Finin “Reinforcement Quantum Annealing: A Hybrid Quantum Learning Automata” In Sci. Rep. 10.1, 2020, pp. 1–11 DOI: 10.1038/s41598-020-64078-1
[Alb+18] Francisco Albarrán-Arriagada, Juan C Retamal, Enrique Solano and Lucas Lamata “Measurement-based adaptation protocol with quantum reinforcement learning” In Phys. Rev. A 98.4, 2018, pp. 042315 DOI: 10.1103/PhysRevA.98.042315
[Alb+20] Francisco Albarrán-Arriagada, Juan Carlos Retamal, Enrique Solano and Lucas Lamata “Reinforcement learning for semi-autonomous approximate quantum eigensolver” In Mach. learn.: sci. technol. 1.1, 2020, pp. 015002 DOI: 10.1088/2632-2153/ab43b4
[Alv+16] Unai Alvarez-Rodriguez, Mikel Sanz, Lucas Lamata and Enrique Solano “Artificial Life in Quantum Technologies” In Sci. Rep. 6.1, 2016, pp. 1–9 DOI: 10.1038/srep20956
[Alv+18] Unai Alvarez-Rodriguez, Mikel Sanz, Lucas Lamata and Enrique Solano “Quantum Artificial Life in an IBM Quantum Computer” In Sci. Rep. 8.1, 2018, pp. 1–9 DOI: 10.1038/s41598-018-33125-3
[Ama98] Shun-Ichi Amari “Natural Gradient Works Efficiently in Learning” In Neural Comput. 10.2, 1998, pp. 251–276 DOI: 10.1162/089976698300017746
[Amb+19] Andris Ambainis et al. “Quantum Speedups for Exponential-Time Dynamic Programming Algorithms” In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, 2019, pp. 1783–1793 DOI: 10.1137/1.9781611975482.107
[Ans+23] James Adu Ansere et al. “Quantum Deep Reinforcement Learning for Dynamic Resource Allocation in Mobile Edge Computing-based IoT Systems” In IEEE Trans. Wirel. Commun., 2023 DOI: 10.1109/TWC.2023.3330868
[AOM17] Mohammad Gheshlaghi Azar, Ian Osband and Rémi Munos “Minimax Regret Bounds for Reinforcement Learning” In Proceedings of the 34th International Conference on Machine Learning 70, 2017, pp. 263–272 URL: https://dl.acm.org/doi/10.5555/3305381.3305409
[Aru+17] Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage and Anil Anthony Bharath “Deep reinforcement learning: A brief survey” In IEEE Signal Process. Mag. 34.6, 2017, pp. 26–38 DOI: 10.1109/MSP.2017.2743240
[Aru+19] Frank Arute et al. “Quantum Supremacy using a Programmable Superconducting Processor” In Nature 574, 2019, pp. 505–510 DOI: 10.1038/s41586-019-1666-5
[BAQ23] BAQIS Quafu Group “Quafu-RL: The Cloud Quantum Computers based Quantum Reinforcement Learning” In arXiv:2305.17966, 2023 DOI: 10.48550/arXiv.2305.17966
[BBA14] Jennifer Barry, Daniel T. Barry and Scott Aaronson “Quantum partially observable Markov decision processes” In Phys. Rev. A 90.3, 2014, pp. 032311 DOI: 10.1103/PhysRevA.90.032311
[BD12] Hans J. Briegel and Gemma De las Cuevas “Projective simulation for artificial intelligence” In Sci. Rep. 2.1, 2012, pp. 1–16 DOI: 10.1038/srep00400
[Bel+20] Dmitrii Beloborodov et al. “Reinforcement learning enhanced quantum-inspired algorithm for combinatorial optimization” In Mach. Learn.: Sci. Technol. 2.2, 2020, pp. 025009 DOI: 10.1088/2632-2153/abc328
[Bel57] Richard Bellman “A Markovian decision process” In J. math. mech. 6.5, 1957, pp. 679–684 URL: https://www.jstor.org/stable/24900506
[Ben+20] Marcello Benedetti, Erika Lloyd, Stefan Sack and Mattia Fiorentini “Parameterized quantum circuits as machine learning models” In Quantum Sci. Technol. 5, 2020, pp. 019601 DOI: 10.1088/2058-9565/ab4eb5
[Ben80] Paul Benioff “The computer as a physical system: A microscopic quantum mechanical Hamiltonian model of computers as represented by Turing machines” In J. Stat. Phys. 22, 1980, pp. 563–591 DOI: 10.1007/BF01011339
[Bha+19] Kishor Bharti, Tobias Haug, Vlatko Vedral and Leong-Chuan Kwek “How to Teach AI to Play Bell Non-Local Games: Reinforcement Learning” In arXiv:1912.10783, 2019 DOI: 10.48550/arXiv.1912.10783
[BKS23] Simon Buchholz, Jonas M Kübler and Bernhard Schölkopf “Multi armed bandits and quantum channel oracles” In arXiv:2301.08544, 2023 DOI: 10.48550/arXiv.2301.08544
[BLT23] Shrigyan Brahmachari, Josep Lumbreras and Marco Tomamichel “Quantum contextual bandits and recommender systems for quantum data” In arXiv:2301.13524, 2023 DOI: 10.48550/arXiv.2301.13524
[Boy+20] Walter L. Boyajian et al. “On the convergence of projective-simulation–based reinforcement learning in Markov decision processes” In Quantum Mach. Intell. 2.13, 2020, pp. 1–21 DOI: 10.1007/s42484-020-00023-9
[Bra+02] Gilles Brassard, Peter Hoyer, Michele Mosca and Alain Tapp “Quantum amplitude amplification and estimation” In Contemp. Math. 305, 2002, pp. 53–74 URL: http://www.ams.org/books/conm/305/
[BYK22] Niyazi Furkan Bar, Hasan Yetis and Mehmet Karakose “An Approach Based on Quantum Reinforcement Learning for Navigation Problems” In 2022 International Conference on Data Analytics for Business and Industry (ICDABI), 2022, pp. 593–597 DOI: 10.1109/ICDABI56818.2022.10041570
[Cár+18] Francisco A Cárdenas-López, Lucas Lamata, Juan Carlos Retamal and Enrique Solano “Multiqubit and multilevel quantum reinforcement learning with quantum technologies” In PloS one 13.7, 2018, pp. e0200455 DOI: 10.1371/journal.pone.0200455
[CCC23] Hao-Yuan Chen, Yen-Jui Chang and Ching-Ray Chang “Deep-Q Learning with Hybrid Quantum Neural Network on Solving Maze Problems” In arXiv:2304.10159, 2023 DOI: 10.48550/arXiv.2304.10159
[CCL19] Iris Cong, Soonwon Choi and Mikhail D. Lukin “Quantum convolutional neural networks” In Nat. Phys. 15.12, 2019, pp. 1273–1278 DOI: 10.1038/s41567-019-0648-8
[CD08] Chun-Lin Chen and Daoyi Dong “Superposition-Inspired Reinforcement Learning and Quantum Reinforcement Learning” In Reinforcement Learning, 2008 DOI: 10.5772/5275
[CD10] Chunlin Chen and Daoyi Dong “Complexity analysis of quantum reinforcement learning” In Proceedings of the 29th Chinese Control Conference, 2010, pp. 5897–5901 URL: https://ieeexplore.ieee.org/abstract/document/5572589
[CDC06] Chun-Lin Chen, Daoyi Dong and Zonghai Chen “Quantum computation for action selection using reinforcement learning” In Int. J. Quantum Inf. 4.06, 2006, pp. 1071–1083 DOI: 10.1142/S0219749906002419
[Cer+21] Marco Cerezo et al. “Variational quantum algorithms” In Nat. Rev. Phys. 3.9, 2021, pp. 625–644 DOI: 10.1038/s42254-021-00348-9
[CFD12] Chen Chunlin, Jiang Frank and Dong Daoyi “Hybrid control of uncertain quantum systems via fuzzy estimation and quantum reinforcement learning” In Proceedings of the 31st Chinese Control Conference, 2012, pp. 7177–7182 URL: https://ieeexplore.ieee.org/abstract/document/6391208
[CGJ19] Shantanav Chakraborty, András Gilyén and Stacey Jeffery “The Power of Block-Encoded Matrix Powers: Improved Regression Techniques via Faster Hamiltonian Simulation” In 46th International Colloquium on Automata, Languages, and Programming (ICALP 2019) 132, 2019, pp. 33:1–33:14 DOI: 10.4230/LIPIcs.ICALP.2019.33
[Che+06] Chunlin Chen, Daoyi Dong, Yu Dong and Qiong Shi “A quantum reinforcement learning method for repeated game theory” In 2006 International Conference on Computational Intelligence and Security 1, 2006, pp. 68–72 DOI: 10.1109/ICCIAS.2006.294092
[Che+19] Chih-Chieh Chen, Shiue-Yuan Shiau, Ming-Feng Wu and Yuh-Renn Wu “Hybrid classical-quantum linear solver using Noisy Intermediate-Scale Quantum machines” In Sci. Rep. 9.1, 2019, pp. 1–12 DOI: 10.1038/s41598-019-52275-6
[Che+20] Samuel Yen-Chi Chen et al. “Variational Quantum Circuits for Deep Reinforcement Learning” In IEEE Access 8, 2020, pp. 141007–141024 DOI: 10.1109/access.2020.3010470
[Che+21] Samuel Yen-Chi Chen, Chih-Min Huang, Chia-Wei Hsing and Ying-Jer Kao “An end-to-end trainable hybrid classical-quantum classifier” In Mach. learn.: sci. technol. 2.4, 2021, pp. 045021 DOI: 10.1088/2632-2153/ac104d
[Che+22] Samuel Yen-Chi Chen et al. “Variational quantum reinforcement learning via evolutionary optimization” In Mach. learn.: sci. technol. 3.1, 2022, pp. 015025 DOI: 10.1088/2632-2153/ac4559
[Che+23] Zhihao Cheng, Kaining Zhang, Li Shen and Dacheng Tao “Offline quantum reinforcement learning in a conservative manner” In Proceedings of the AAAI Conference on Artificial Intelligence 37.6, 2023, pp. 7148–7156 DOI: 10.1609/aaai.v37i6.25872
[Che+23a] Zhihao Cheng, Kaining Zhang, Li Shen and Dacheng Tao “Quantum Imitation Learning” In IEEE Trans. Neural Netw. Learn. Syst., 2023, pp. 1–15 DOI: 10.1109/TNNLS.2023.3275075
[Che+23b] El Amine Cherrat et al. “Quantum Deep Hedging” In Quantum 7, 2023, pp. 1191 DOI: 10.22331/q-2023-11-29-1191
[Che10] Ran Cheng “Quantum Geometric Tensor (Fubini-Study Metric) in Simple Quantum System: A pedagogical Introduction” In arXiv:1012.1337, 2010 DOI: 10.48550/arXiv.1012.1337
[Che23] Samuel Yen-Chi Chen “Asynchronous training of quantum reinforcement learning” In arXiv:2301.05096, 2023 DOI: 10.48550/arXiv.2301.05096
[Che23a] Samuel Yen-Chi Chen “Efficient quantum recurrent reinforcement learning via quantum reservoir computing” In arXiv:2309.07339, 2023 DOI: 10.48550/arXiv.2309.07339
[Che23b] Samuel Yen-Chi Chen “Quantum Deep Q-Learning with Distributed Prioritized Experience Replay” In IEEE International Conference on Quantum Computing and Engineering (QCE) 2, 2023, pp. 31–35 DOI: 10.1109/QCE57702.2023.10180
[Che23c] Samuel Yen-Chi Chen “Quantum deep recurrent reinforcement learning” In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5 DOI: 10.1109/ICASSP49357.2023.10096981
[Che23d] Samuel Yen-Chi Chen “Quantum Reinforcement Learning for Quantum Architecture Search” In Proceedings of the 2023 International Workshop on Quantum Classical Cooperative, 2023, pp. 17–20 DOI: 10.1145/3588983.3596692
[Cho+23] Byung** Cho, Yu Xiao, Pan Hui and Daoyi Dong “Quantum bandit with amplitude amplification exploration in an adversarial environment” In IEEE Transactions on Knowledge and Data Engineering, 2023 DOI: 10.1109/TKDE.2023.3279207
[CKP23] El Amine Cherrat, Iordanis Kerenidis and Anupam Prakash “Quantum reinforcement learning via policy iteration” In Quantum Mach. Intell. 5.2, 2023, pp. 30 DOI: 10.1007/s42484-023-00116-1
[CKS17] Andrew M. Childs, Robin Kothari and Rolando D. Somma “Quantum Algorithm for Systems of Linear Equations with Exponentially Improved Dependence on Precision” In SIAM J. Comput. 46.6, 2017, pp. 1920–1950 DOI: 10.1137/16M1087072
[Cob23] Joyce G.H. Cobussen “Quantum Reinforcement Learning for Sensor-Assisted Robot Navigation Tasks”, 2023 URL: https://lup.lub.lu.se/student-papers/search/publication/9141398
[Cor+23] Randall Correll et al. “Quantum Neural Networks for a Supply Chain Logistics Application” In Adv. Quantum Technol. 6.7, 2023, pp. 2200183 DOI: 10.1002/qute.202200183
[Cor18] Arjan Cornelissen “Quantum gradient estimation and its application to quantum reinforcement learning”, 2018 URL: https://repository.tudelft.nl/islandora/object/uuid:26fe945f-f02e-4ef7-bdcb-0a2369eb867e
[Cra+18] Daniel Crawford et al. “Reinforcement Learning Using Quantum Boltzmann Machines” In Quantum Inf. Comput. 18.1–2, 2018, pp. 51–74 URL: https://www.rintonpress.com/journals/doi/QIC18.1-2-3.html
[CRC23] James Chao, Ramiro Rodriguez and Sean Crowe “Quantum Enhancements for AlphaZero” In Proceedings of the Companion Conference on Genetic and Evolutionary Computation, 2023, pp. 2179–2186 DOI: 10.1145/3583133.3596302
[Cro19] Gavin E. Crooks “Gradients of parameterized quantum gates using the parameter-shift rule and gate decomposition” In arXiv:1905.13311, 2019 DOI: 10.48550/arXiv.1905.13311
[ÇY23] Ercan Çağlar and İhsan Yilmaz “Secure Communication Based On Key Generation With Quantum Reinforcement Learning” In Int. J. Inf. Secur. 12.2, 2023, pp. 22–41 DOI: 10.55859/ijiss.1264169
[CYF22] Samuel Yen-Chi Chen, Shinjae Yoo and Yao-Lung L Fang “Quantum long short-term memory” In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 8622–8626 IEEE
[Dal+20] Mogens Dalgaard, Felix Motzoi, Jens Jakob Sørensen and Jacob Sherson “Global optimization of quantum dynamics with AlphaZero deep exploration” In NPJ Quantum Inf. 6.1, 2020, pp. 1–9 DOI: 10.1038/s41534-019-0241-0
[Dal+22] Nicola Dalla Pozza, Lorenzo Buffoni, Stefano Martina and Filippo Caruso “Quantum reinforcement learning: the maze problem” In Quantum Mach. Intell. 4.1, 2022, pp. 1–10 DOI: 10.1007/s42484-022-00068-y
[DFB15] Vedran Dunjko, Nicolai Friis and Hans J Briegel “Quantum-enhanced deliberation of learning agents using trapped ions” In New J. Phys. 17.2, 2015, pp. 023006 DOI: 10.1088/1367-2630/17/2/023006
[DH96] Christoph Durr and Peter Hoyer “A quantum algorithm for finding the minimum” In arXiv:quant-ph/9607014, 1996
[DJ92] D. Deutsch and R. Jozsa “Rapid Solution of Problems by Quantum Computation” In Proc. R. Soc. Lond. 439.1907, 1992 DOI: 10.1098/rspa.1992.0167
[Don+06] Daoyi Dong, Chun-Lin Chen, Zonghai Chen and Chen-Bin Zhang “Quantum mechanics helps in learning for more intelligent robots” In Chinese Phys. Lett. 23.7, 2006, pp. 1691 DOI: 10.1088/0256-307X/23/7/010
[Don+06a] Daoyi Dong, Chunlin Chen, Chenbin Zhang and Zonghai Chen “Quantum robot: structure, algorithms and applications” In Robotica 24.4, 2006, pp. 513–521 DOI: 10.1017/S0263574705002596
[Don+08] Daoyi Dong et al. “Incoherent control of quantum systems with wavefunction-controllable subspaces via quantum reinforcement learning” In IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 38.4, 2008, pp. 957–962 DOI: 10.1109/TSMCB.2008.926603
[Don+08a] Daoyi Dong, Chun-Lin Chen, Han-Xiong Li and Tzyh-Jong Tarn “Quantum reinforcement learning” In IEEE Trans. Syst. Man Cybern., Part B (Cybernetics) 38.5, 2008, pp. 1207–1220 DOI: 10.1109/TSMCB.2008.925743
[Don+12] Daoyi Dong, Chun-Lin Chen, Jian Chu and Tzyh-Jong Tarn “Robust Quantum-Inspired Reinforcement Learning for Robot Navigation” In IEEE/ASME Trans Mechatron 17.1, 2012, pp. 86–97 DOI: 10.1109/TMECH.2010.2090896
[Dră+22] Theodora-Augustina Drăgan, Maureen Monnet, Christian B Mendl and Jeanette Miriam Lorenz “Quantum Reinforcement Learning for Solving a Stochastic Frozen Lake Environment and the Impact of Quantum Architecture Choices” In arXiv:2212.07932, 2022 DOI: 10.48550/arXiv.2212.07932
[DS22] Li Ding and Lee Spector “Evolutionary quantum architecture search for parametrized quantum circuits” In Proceedings of the Genetic and Evolutionary Computation Conference Companion, 2022, pp. 2190–2195 DOI: 10.1145/3520304.3534012
[DS23] Li Ding and Lee Spector “Multi-Objective Evolutionary Architecture Search for Parameterized Quantum Circuits” In Entropy 25.1, 2023, pp. 93–105 DOI: 10.3390/e25010093
[DTB15] Vedran Dunjko, Jacob M Taylor and Hans J Briegel “Framework for learning agents in quantum environments” In arXiv:1507.08482, 2015 URL: https://arxiv.longhoe.net/abs/1507.08482
[DTB16] Vedran Dunjko, Jacob M. Taylor and Hans J. Briegel “Quantum-Enhanced Machine Learning” In Phys. Rev. Lett. 117.13, 2016, pp. 130501 DOI: 10.1103/PhysRevLett.117.130501
[DTB17] Vedran Dunjko, Jacob M Taylor and Hans J Briegel “Advances in quantum reinforcement learning” In 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2017, pp. 282–287 DOI: 10.1109/SMC.2017.8122616
[Dun+18] Vedran Dunjko, Yi-Kai Liu, Xingyao Wu and Jacob M Taylor “Exponential improvements for quantum-accessible reinforcement learning” In arXiv:1710.11160, 2018 DOI: 10.48550/arXiv.1710.11160
[EGW05] Damien Ernst, Pierre Geurts and Louis Wehenkel “Tree-based batch mode reinforcement learning” In J. Mach. Learn. Res. 6, 2005, pp. 503–556 URL: http://jmlr.org/papers/v6/ernst05a.html
[Fak+13] Pegah Fakhari, Karthikeyan Rajagopal, SN Balakrishnan and JR Busemeyer “Quantum inspired reinforcement learning in changing environment” In New Math. Nat. Comput. 9.03, 2013, pp. 273–294 DOI: 10.1142/S1793005713400073
[Fey82] Richard P. Feynman “Simulating physics with computers” In Int. J. Theor. Phys. 21.6/7, 1982, pp. 467–488 DOI: 10.1007/BF02650179
[FH23] Jesús Fernández-Villaverde and Isaiah J Hull “Dynamic Programming on a Quantum Annealer: Solving the RBC Model”, 2023 DOI: 10.3386/w31326
[Fla+20] Fulvio Flamini et al. “Photonic architecture for reinforcement learning” In New J. Phys. 22.4, 2020, pp. 045002 DOI: 10.1109/PN50013.2020.9166962
[Fla+23] Fulvio Flamini et al. “Reinforcement learning and decision making via single-photon quantum walks” In arXiv:2301.13669, 2023 DOI: 10.48550/arXiv.2301.13669
[Fös+18] Thomas Fösel, Petru Tighineanu, Talitha Weiss and Florian Marquardt “Reinforcement Learning with Neural Networks for Quantum Feedback” In Phys. Rev. X 8.3, 2018, pp. 031084 DOI: 10.1103/PhysRevX.8.031084
[FP+23] Getahun Fikadu Tilaye and Amit Pandey “Investigating the effects of hyperparameters in quantum-enhanced deep reinforcement learning” In Quantum Eng. 2023, 2023 DOI: 10.1155/2023/2451990
[Fra+22] Maja Franz et al. “Uncovering instabilities in variational-quantum deep Q-networks” In J. Franklin Inst., 2022 DOI: 10.1016/j.jfranklin.2022.08.021
[Fuj+19] Scott Fujimoto, Edoardo Conti, Mohammad Ghavamzadeh and Joelle Pineau “Benchmarking batch deep reinforcement learning algorithms” In arXiv:1910.01708, 2019 DOI: 10.48550/arXiv.1910.01708
[GA23] Bhargav Ganguly and Vaneet Aggarwal “Quantum Acceleration of Infinite Horizon Average-Reward Reinforcement Learning” In arXiv:2310.11684, 2023 DOI: 10.48550/arXiv.2310.11684
[Gan+23] Bhargav Ganguly, Yulian Wu, Di Wang and Vaneet Aggarwal “Quantum Computing Provides Exponential Regret Improvement in Episodic Reinforcement Learning” In arXiv:2302.08617, 2023 DOI: 10.48550/arXiv.2302.08617
[GAW19] András Gilyén, Srinivasan Arunachalam and Nathan Wiebe “Optimizing quantum optimization algorithms via faster quantum gradient computation” In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, 2019, pp. 1425–1444 DOI: 10.1137/1.9781611975482.87
[GB10] Xavier Glorot and Yoshua Bengio “Understanding the difficulty of training deep feedforward neural networks” In J. Mach. Learn. Res. 9, 2010, pp. 249–256 URL: https://proceedings.mlr.press/v9/glorot10a.html
[GH19] Michael Ganger and Wei Hu “Quantum Multiple Q-Learning” In International Journal of Intelligence Science 9.01, 2019, pp. 1–22 DOI: 10.4236/ijis.2019.91001
[Gil+19] András Gilyén, Yuan Su, Guang Hao Low and Nathan Wiebe “Quantum singular value transformation and beyond: exponential improvements for quantum matrix arithmetics” In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, 2019, pp. 193–204 DOI: 10.1145/3313276.3316366
[Gra+19] Edward Grant, Leonard Wossnig, Mateusz Ostaszewski and Marcello Benedetti “An initialization strategy for addressing barren plateaus in parametrized quantum circuits” In Quantum 3, 2019, pp. 214 DOI: 10.22331/q-2019-12-09-214
[Haa+17] Tuomas Haarnoja, Haoran Tang, Pieter Abbeel and Sergey Levine “Reinforcement learning with deep energy-based policies” In Proceedings of Machine Learning Research 70, 2017, pp. 1352–1361 URL: https://proceedings.mlr.press/v70/haarnoja17a.html
[Haa+18] Tuomas Haarnoja et al. “Soft actor-critic algorithms and applications” In arXiv:1812.05905, 2018 DOI: 10.48550/arXiv.1812.05905
[Ham21] Yassine Hamoudi “Quantum Sub-Gaussian Mean Estimator” In 29th Annual European Symposium on Algorithms (ESA 2021) 204, 2021, pp. 50:1–50:17 DOI: 10.4230/LIPIcs.ESA.2021.50
[Has10] Hado Hasselt “Double Q-learning” In NeurIPS 23.2, 2010, pp. 2613–2621 URL: https://proceedings.neurips.cc/paper/2010/hash/091d584fced301b442654dd8c23b3fc9-Abstract.html
[HDW21] Arne Hamann, Vedran Dunjko and Sabine Wölk “Quantum-accessible reinforcement learning beyond strictly epochal environments” In Quantum Mach. Intell. 3.22, 2021, pp. 1–18 DOI: 10.1007/s42484-021-00049-7
[Hei+22] Dirk Heimann, Hans Hohenfeld, Felix Wiebe and Frank Kirchner “Quantum deep reinforcement learning for robot navigation tasks” In arXiv:2202.12180, 2022 DOI: 10.48550/arXiv.2202.12180
[HH19] Wei Hu and James Hu “ $Q$ Learning with Quantum Neural Networks” In Natural Science 11.01, 2019, pp. 31–39 DOI: 10.4236/ns.2019.111005
[HH19a] Wei Hu and James Hu “Distributional Reinforcement Learning with Quantum Neural Networks” In Intelligent Control and Automation 10.02, 2019, pp. 63–78 DOI: 10.4236/ica.2019.102004
[HH19b] Wei Hu and James Hu “Reinforcement Learning with Deep Quantum Neural Networks” In Journal of Quantum Information Science 9.01, 2019, pp. 1–14 DOI: 10.4236/jqis.2019.91001
[HH19c] Wei Hu and James Hu “Training a Quantum Neural Network to Solve the Contextual Multi-Armed Bandit Problem” In Natural Science 11, 2019, pp. 17–27 DOI: 10.4236/ns.2019.111003
[Hic+23] Manuel Lautaro Hickmann et al. “Potential analysis of a Quantum RL controller in the context of autonomous driving” In 31st European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2023, 2023, pp. 263–268 DOI: 10.14428/esann/2023.ES2023-22
[HK21] Tobias Haug and MS Kim “Optimal training of variational quantum algorithms without barren plateaus” In arXiv:2104.14543, 2021 DOI: 10.48550/arXiv.2104.14543
[Hsi+22] Jen-Yueh Hsiao et al. “Unentangled quantum reinforcement learning agents in the OpenAI Gym” In arXiv:2203.14348, 2022 DOI: 10.48550/arXiv.2203.14348
[HSS06] Tomoki Hamagami, Takashi Shibuya and Shingo Shimada “Complex-valued reinforcement learning” In 2006 IEEE International Conference on Systems, Man and Cybernetics 5, 2006, pp. 4175–4179 DOI: 10.1109/ICSMC.2006.384789
[HSW89] Kurt Hornik, Maxwell Stinchcombe and Halbert White “Multilayer feedforward networks are universal approximators” In Neural Netw. 2.5, 1989, pp. 359–366 DOI: 10.1016/0893-6080(89)90020-8
[Hu+21] Yazhou Hu, Fengzhen Tang, Jun Chen and Wenxue Wang “Quantum-enhanced reinforcement learning for control: a preliminary study” In Control. Theory Technol. 19, 2021, pp. 455–464 DOI: 10.1007/s11768-021-00063-x
[HW22] Arne Hamann and Sabine Wölk “Performance analysis of a hybrid agent for quantum-accessible reinforcement learning” In New J. Phys. 24.3, 2022, pp. 033044 DOI: 10.1088/1367-2630/ac5b56
[IBM23] IBM Quantum “Qiskit Runtime Service, Sampler primitive (Version 0.9.1)”, https://quantum-computing.ibm.com/, 2023
[Jaš+19] Jan Jašek et al. “Experimental hybrid quantum-classical reinforcement learning by boson sampling: how to train a quantum cloner” In Optics Express 27.22, 2019, pp. 32454–32464 DOI: 10.1364/OE.27.032454
[Jer+21] Sofiene Jerbi et al. “Parametrized Quantum Policies for Reinforcement Learning” In Adv. Neural Inf. Process. Syst. 34, 2021, pp. 28362–28375 DOI: 10.5281/zenodo.5833370
[Jer+21a] Sofiene Jerbi et al. “Quantum Enhancements for Deep Reinforcement Learning in Large Spaces” In Phys. Rev. X Quantum 2.1, 2021, pp. 010328 DOI: 10.1103/PRXQuantum.2.010328
[Jer+23] Sofiene Jerbi, Arjan Cornelissen, Māris Ozols and Vedran Dunjko “Quantum Policy Gradient Algorithms” In 18th Conference on the Theory of Quantum Computation, Communication and Cryptography (TQC 2023), 2023, pp. 13:1–13:24 DOI: 10.4230/LIPIcs.TQC.2023.13
[JOA10] Thomas Jaksch, Ronald Ortner and Peter Auer “Near-optimal Regret Bounds for Reinforcement Learning” In J. Mach. Learn. Res. 11.51, 2010, pp. 1563–1600 URL: http://jmlr.org/papers/v11/jaksch10a.html
[Jor05] Stephen P. Jordan “Fast Quantum Algorithm for Numerical Gradient Estimation” In Phys. Rev. Lett. 95.5, 2005, pp. 050501 DOI: 10.1103/PhysRevLett.95.050501
[KCP23] Gyu Seon Kim, JaeHyun Chung and Soohyun Park “Realizing Stabilized Landing for Computation-Limited Reusable Rockets: A Quantum Reinforcement Learning Approach” In arXiv:2310.06541, 2023 DOI: 10.48550/arXiv.2310.06541
[Kha+19] Sami Khairy et al. “Reinforcement-Learning-Based Variational Quantum Circuits Optimization for Combinatorial Problems” In arXiv:1911.04574, 2019 DOI: 10.48550/arXiv.1911.04574
[Kha+20] Sami Khairy et al. “Learning to Optimize Variational Quantum Circuits to Solve Combinatorial Problems” In Proceedings of the AAAI Conference on Artificial Intelligence 34.03, 2020, pp. 2367–2375 DOI: 10.1609/aaai.v34i03.5616
[Kim+21] Tomoaki Kimura et al. “Variational Quantum Circuit-Based Reinforcement Learning for POMDP and Experimental Implementation” In Math. Probl. Eng. 2021, 2021, pp. 3511029 DOI: 10.1155/2021/3511029
[KLM21] Iordanis Kerenidis, Jonas Landman and Natansh Mathur “Classical and quantum algorithms for orthogonal neural networks” In arXiv:2106.07198, 2021 DOI: 10.48550/arXiv.2106.07198
[Köl+23] Michael Kölle et al. “Multi-Agent Quantum Reinforcement Learning using Evolutionary Optimization” In arXiv:2311.05546, 2023 DOI: 10.48550/arXiv.2311.05546
[Kos+21] Ilya Kostrikov, Rob Fergus, Jonathan Tompson and Ofir Nachum “Offline reinforcement learning with fisher divergence critic regularization” In International Conference on Machine Learning, 2021, pp. 5774–5783 PMLR URL: https://proceedings.mlr.press/v139/kostrikov21a.html
[KP20] Iordanis Kerenidis and Anupam Prakash “A quantum interior point method for LPs and SDPs” In ACM Transactions on Quantum Computing 1.1, 2020, pp. 1–32 DOI: 10.1145/3406306
[Kru+23] Georg Kruse, Theodora-Augustina Dragan, Robert Wille and Jeanette Miriam Lorenz “Variational Quantum Circuit Design for Quantum Reinforcement Learning on Continuous Environments” In arXiv:2312.13798, 2023 DOI: 10.48550/arXiv.2312.13798
[KSG21] Kunal Kashyap, Daksh Shah and Lokesh Gautam “From Classical to Quantum: A Review of Recent Progress in Reinforcement Learning” In 2021 2nd International Conference for Emerging Technology (INCET), 2021, pp. 1–5 DOI: 10.1109/INCET51464.2021.9456218
[KT03] Vijay Konda and John N. Tsitsiklis “On Actor-Critic Algorithms” In SIAM J. Control Optim. 42.4, 2003, pp. 1143–1166 DOI: https://doi.org/10.1137/S0363012901385691
[Kum+20] Aviral Kumar, Aurick Zhou, George Tucker and Sergey Levine “Conservative q-learning for offline reinforcement learning” In Advances in Neural Information Processing Systems 33, 2020, pp. 1179–1191 URL: https://proceedings.neurips.cc/paper/2020/hash/0d2b2061826a5df3221116a5085a6052-Abstract.html
[Kum+23] Manoj Kumar, Upasana Dohare, Sushil Kumar and Neeraj Kumar “Blockchain Based Optimized Energy Trading for E-Mobility Using Quantum Reinforcement Learning” In IEEE Trans. Veh. Technol. 72.4, 2023, pp. 5167–5180 DOI: 10.1109/TVT.2022.3225524
[Kun22] Leonhard Kunczik “Reinforcement Learning with Hybrid Quantum Approximation in the NISQ Context”, 2022 DOI: 10.1007/978-3-658-37616-1
[KVW18] Wouter Kool, Herke Van Hoof and Max Welling “Attention, learn to solve routing problems!” In arXiv:1803.08475, 2018 DOI: 10.48550/ARXIV.1803.08475
[Kwa+21] Yunseok Kwak et al. “Introduction to Quantum Reinforcement Learning: Theory and PennyLane-based Implementation” In 2021 International Conference on Information and Communication Technology Convergence (ICTC), 2021, pp. 416–420 DOI: 10.1109/ICTC52510.2021.9620885
[LAD21] Yuanjian Li, A Hamid Aghvami and Daoyi Dong “Intelligent Trajectory Planning in UAV-Mounted Wireless Networks: A Quantum-Inspired Reinforcement Learning Perspective” In IEEE Wireless Commun. Lett. 10.9, 2021, pp. 1994–1998 DOI: 10.1109/LWC.2021.3089876
[Lam17] Lucas Lamata “Basic protocols in quantum reinforcement learning with superconducting circuits” In Sci. Rep. 7.1, 2017, pp. 1–10 DOI: 10.1038/s41598-017-01711-6
[Lam21] Lucas Lamata “Quantum Reinforcement Learning with Quantum Photonics” In Photonics 8.2, 2021, pp. 33 DOI: 10.3390/photonics8020033
[Lam23] Lucas Lamata “Quantum Machine Learning Implementations: Proposals and Experiments” In Adv. Quantum Technol. 6.7, 2023, pp. 2300059 DOI: 10.1002/qute.202300059
[Lan21] Qingfeng Lan “Variational Quantum Soft Actor-Critic” In arXiv:2112.11921, 2021 DOI: 10.48550/arXiv.2112.11921
[LAT21] Yunchao Liu, Srinivasan Arunachalam and Kristan Temme “A rigorous and robust quantum speed-up in supervised machine learning” In Nat. Phys. 17, 2021, pp. 1013–1017 DOI: 10.1038/s41567-021-01287-z
[Lev+17] Anna Levit et al. “Free energy-based reinforcement learning using a quantum processor” In arXiv:1706.00074, 2017 DOI: 10.48550/arXiv.1706.00074
[Lev+20] Sergey Levine, Aviral Kumar, George Tucker and Justin Fu “Offline reinforcement learning: Tutorial, review, and perspectives on open problems” In arXiv:2005.01643, 2020 DOI: 10.48550/arXiv.2005.01643
[LHT22] Josep Lumbreras, Erkka Haapasalo and Marco Tomamichel “Multi-armed quantum bandits: Exploration versus exploitation when learning properties of quantum states” In Quantum 6, 2022, pp. 749 DOI: 10.22331/q-2022-06-29-749
[Li+20] Ji-An Li et al. “Quantum reinforcement learning during human decision-making” In Nat. Hum. Behav. 4.3, 2020, pp. 294–307 DOI: 10.1038/s41562-019-0804-2
[Li+20a] Gen Li et al. “Breaking the sample size barrier in model-based reinforcement learning with a generative model” In Adv. Neural Inf. Process Syst. 33, 2020, pp. 12861–12872 URL: https://proceedings.neurips.cc/paper/2020/hash/96ea64f3a1aa2fd00c72faacf0cb8ac9-Abstract.html
[Lin92] Long-Ji Lin “Self-improving reactive agents based on reinforcement learning, planning and teaching” In Mach. Learn. 8.3, 1992, pp. 293–321 DOI: 10.1007/BF00992699
[Liu+22] Wenjie Liu et al. “A quantum system control method based on enhanced reinforcement learning” In Soft Comput. 26.14, 2022, pp. 6567–6575 DOI: 10.1007/s00500-022-07179-5
[Liu+23] Dan Liu et al. “Multi-agent quantum-inspired deep reinforcement learning for real-time distributed generation control of 100% renewable energy systems” In Eng. Appl. Artif. Intell. 119, 2023, pp. 105787 DOI: 10.1016/j.engappai.2022.105787
[LJ09] Mantas Lukoševičius and Herbert Jaeger “Reservoir computing approaches to recurrent neural network training” In Comput. Sci. Rev. 3.3, 2009, pp. 127–149 DOI: 10.1016/j.cosrev.2009.03.005
[LJW22] Yi-Pei Liu, Qing-Shan Jia and Xu Wang “Quantum reinforcement learning method and application based on value function” In IFAC-PapersOnLine 55.11, 2022, pp. 132–137 DOI: 10.1016/j.ifacol.2022.08.061
[LM23] Victor Lopez-Pastor and Florian Marquardt “Self-learning machines based on Hamiltonian echo backpropagation” In Phys. Rev. X 13.3, 2023, pp. 031020 DOI: 10.1103/PhysRevX.13.031020
[Lok+22] S Lokes et al. “Implementation of Quantum Deep Reinforcement Learning Using Variational Quantum Circuits” In 2022 International Conference on Trends in Quantum Computing and Emerging Business Technologies (TQCEBT), 2022, pp. 1–4 DOI: 10.1109/TQCEBT54229.2022.10041479
[Low+17] Ryan Lowe et al. “Multi-agent actor-critic for mixed cooperative-competitive environments” In Adv. Neural Inf. Process. Syst. 31, 2017, pp. 6382–6393 URL: https://dl.acm.org/doi/10.5555/3295222.3295385
[LP03] Michail G Lagoudakis and Ronald Parr “Least-squares policy iteration” In J. Mach. Learn. Res. 4, 2003, pp. 1107–1149 URL: https://www.jmlr.org/papers/v4/lagoudakis03a.html
[LS20] Owen Lockwood and Mei Si “Reinforcement Learning with Quantum Variational Circuit” In Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment 16.1, 2020, pp. 245–251 URL: https://ojs.aaai.org/index.php/AIIDE/article/view/7437
[LS21] Owen Lockwood and Mei Si “Playing Atari with Hybrid Quantum-Classical Reinforcement Learning” In NeurIPS 2020 Workshop on Pre-registration in Machine Learning 148, 2021, pp. 285–301 PMLR URL: https://proceedings.mlr.press/v148/lockwood21a.html
[Luo+20] Xiu-Zhe Luo, **-Guo Liu, Pan Zhang and Lei Wang “Yao. jl: Extensible, efficient framework for quantum algorithm design” In Quantum 4, 2020, pp. 341 DOI: 10.22331/q-2020-10-11-341
[LXJ23] Yaofu Liu, Chang Xu and Siyuan ** “Reinforcement Learning for Continuous Control: A Quantum Normalized Advantage Function Approach” In 2023 IEEE International Conference on Quantum Software (QSW), 2023, pp. 83–87 DOI: 10.1109/QSW59989.2023.00020
[LZ22] Tongyang Li and Ruizhe Zhang “Quantum Speedups of Optimizing Approximately Convex Functions with Applications to Logarithmic Regret Stochastic Convex Bandits” In Advances in Neural Information Processing Systems 35, 2022, pp. 3152–3164 URL: https://proceedings.neurips.cc/paper_files/paper/2022/hash/14f75513f0f1ca01de1e826b52e6b840-Abstract-Conference.html
[Mei+23] Kai Meinerz, Simon Trebst, Mark Rudner and Evert Nieuwenburg “The Quantum Cartpole: A benchmark environment for non-linear reinforcement learning” In arXiv:2311.00756, 2023 DOI: 10.48550/arXiv.2311.00756
[Mel+17] Alexey A. Melnikov, Adi Makmal, Vedran Dunjko and Hans J. Briegel “Projective simulation with generalization” In Sci. Rep. 7.1, 2017, pp. 1–14 DOI: 10.1038/s41598-017-14740-y
[Mey+23] Nico Meyer et al. “Quantum Natural Policy Gradients: Towards Sample-Efficient Reinforcement Learning” In IEEE International Conference on Quantum Computing and Engineering (QCE) 2, 2023, pp. 36–41 DOI: 10.1109/QCE57702.2023.10181
[Mey+23a] Nico Meyer et al. “Quantum Policy Gradient Algorithm with Optimized Action Decoding” In International Conference on Machine Learning (ICML) 202, 2023, pp. 24592–24613 PMLR URL: https://proceedings.mlr.press/v202/meyer23a.html
[Mey21] Nico Meyer “Variational Quantum Circuits for Policy Approximation”, 2021
[MK21] Maximilian Moll and Leonhard Kunczik “Comparing quantum hybrid reinforcement learning to classical methods” In Hum. Intell. Syst. Integr. 3.1, 2021, pp. 15–23 DOI: 10.1007/s42454-021-00025-3
[ML21] José D Martín-Guerrero and Lucas Lamata “Reinforcement Learning and Physics” In Appl. Sci. 11.18, 2021, pp. 8589 DOI: 10.3390/app11188589
[ML22] José D. Martín-Guerrero and Lucas Lamata “Quantum Machine Learning: A tutorial” In Neurocomputing 470, 2022, pp. 457–461 DOI: 10.1016/j.neucom.2021.02.102
[Mni+15] Volodymyr Mnih et al. “Human-level control through deep reinforcement learning” In Nature 518.7540, 2015, pp. 529–533 DOI: 10.1038/nature14236
[Mni+16] Volodymyr Mnih et al. “Asynchronous methods for deep reinforcement learning” In International Conference on Machine Learning (ICML) 48 PMLR, 2016, pp. 1928–1937 URL: https://proceedings.mlr.press/v48/mniha16.html
[MNM17] Masaki Mochida, Hidehiro Nakano and Arata Miyauchi “A complex-valued reinforcement learning method using complex-valued neural networks” In IEICE Technical Report; IEICE Tech. Rep. 117.112, 2017, pp. 1–5 URL: https://ken.ieice.org/ken/paper/20170629ebuV/eng/
[Mon15] Ashley Montanaro “Quantum speedup of Monte Carlo methods” In Proc. Math. Phys. Eng. Sci. 471.2181, 2015, pp. 20150301 DOI: 10.1098/rspa.2015.0301
[Mül+21] Tobias Müller, Christoph Roch, Kyrill Schmid and Philipp Altmann “Towards Multi-Agent Reinforcement Learning using Quantum Boltzmann Machines” In arXiv:2109.10900, 2021 DOI: 10.48550/arXiv.2109.10900
[MVB22] Thomas Mullor, David Vigouroux and Louis Bethune “Efficient circuit implementation for coined quantum walks on binary trees and application to reinforcement learning” In IEEE/ACM Symposium on Edge Computing (SEC), 2022, pp. 436–443 DOI: 10.1109/SEC54971.2022.00066
[Nag+21] Dániel Nagy et al. “Photonic quantum policy learning in OpenAI Gym” In IEEE International Conference on Quantum Computing and Engineering (QCE), 2021, pp. 123–129 DOI: 10.1109/QCE52317.2021.00028
[Neu+17] Florian Neukart et al. “Traffic flow optimization using a quantum annealer” In Front. ICT 4, 2017, pp. 29 DOI: 10.3389/fict.2017.00029
[Neu+20] Niels MP Neumann, Paolo BUL Heer, Irina Chiscop and Frank Phillipson “Multi-agent Reinforcement Learning Using Simulated Quantum Annealing” In International Conference on Computational Science, 2020, pp. 562–575 DOI: 10.1007/978-3-030-50433-5_43
[NGC15] Sinan Nuuman, David Grace and Tim Clarke “A quantum inspired reinforcement learning technique for beyond next generation wireless networks” In 2015 IEEE Wireless Communications and Networking Conference Workshops (WCNCW), 2015, pp. 271–275 DOI: 10.1109/WCNCW.2015.7122566
[NHP23] Niels MP Neumann, Paolo BUL Heer and Frank Phillipson “Quantum reinforcement learning: Comparing quantum annealing and gate-based quantum computing with classical deep reinforcement learning” In Quantum Inf. Process. 22.2, 2023, pp. 125 DOI: 10.1007/s11128-023-03867-9
[Nir+21] Dipesh Niraula et al. “Quantum deep reinforcement learning for clinical decision support in oncology: application to adaptive radiotherapy” In Sci. Rep. 11.1, 2021, pp. 1–13 DOI: 10.1038/s41598-021-02910-y
[NL16] M.A. Nielsen and Chuang I. L. “Quantum Computation and Quantum Information (10th Anniversary edition)” Cambridge University Press, 2016 DOI: 10.1017/CBO9780511976667
[NLH20] Rui Nian, **feng Liu and Biao Huang “A review On reinforcement learning: Introduction and applications in industrial process control” In Comput. Chem. Eng. 139, 2020, pp. 106886 DOI: 10.1016/j.compchemeng.2020.106886
[NS+23] Bhaskara Narottama and Soo Young Shin “Layerwise Quantum Deep Reinforcement Learning for Joint Optimization of UAV Trajectory and Resource Allocation” In IEEE Internet Things J., 2023 DOI: 10.1109/JIOT.2023.3285968
[NW05] Sanjeev Naguleswaran and Langford B. White “Quantum search in stochastic planning” In Noise and Information in Nanoelectronics, Sensors, and Standards III 5846, 2005, pp. 34–45 DOI: 10.1117/12.609962
[NY23] Egor E Nuzhin and Dmitry Yudin “Quantum-enhanced policy iteration on the example of a mountain car” In arXiv:2308.08348, 2023 DOI: 10.48550/arXiv.2308.08348
[Oli+20] Julio Olivares-Sánchez, Jorge Casanova, Enrique Solano and Lucas Lamata “Measurement-Based Adaptation Protocol with Quantum Reinforcement Learning in a Rigetti Quantum Computer” In Quantum Reports 2.2, 2020, pp. 293–304 DOI: 10.3390/quantum2020019
[Pap+14] Giuseppe Davide Paparo et al. “Quantum Speedup for Active Learning Agents” In Phys. Rev. X 4.3, 2014, pp. 031002 DOI: 10.1103/PhysRevX.4.031002
[Par+23] Chanyoung Park et al. “Quantum Multi-Agent Actor-Critic Networks for Cooperative Mobile Access in Multi-UAV Systems” In IEEE Internet Things J. 10.22, 2023, pp. 20033–20048 DOI: 10.1109/JIOT.2023.3282908
[Par+23a] Soohyun Park et al. “Quantum Multi-Agent Reinforcement Learning for Autonomous Mobility Cooperation” In IEEE Commun. Mag., 2023 DOI: 10.1109/MCOM.020.2300199
[Per+06] David Perez-Garcia, Frank Verstraete, Michael M Wolf and J Ignacio Cirac “Matrix product state representations” In arXiv:0608197, 2006 DOI: 10.48550/arXiv.quant-ph/0608197
[Pér+20] Adrián Pérez-Salinas, Alba Cervera-Lierta, Elies Gil-Fuster and José I Latorre “Data re-uploading for a universal quantum classifier” In Quantum 4, 2020, pp. 226 DOI: 10.22331/q-2020-02-06-226
[Per+22] Maniraman Periyasamy et al. “Incremental Data-Uploading for Full-Quantum Classification” In IEEE International Conference on Quantum Computing and Engineering (QCE), 2022, pp. 31–37 DOI: 10.1109/QCE53715.2022.00021
[Per+23] Maniraman Periyasamy et al. “Batch Quantum Reinforcement Learning” In arXiv:2305.00905, 2023 DOI: 10.48550/arXiv.2305.00905
[Pes+21] Arthur Pesah et al. “Absence of barren plateaus in quantum convolutional neural networks” In Phys. Rev. X 11.4, 2021, pp. 041011 DOI: 10.1103/PhysRevX.11.041011
[PK23] Soohyun Park and Joongheon Kim “Quantum Reinforcement Learning for Large-Scale Multi-Agent Decision-Making in Autonomous Aerial Networks” In 2023 VTS Asia Pacific Wireless Communications Symposium (APWCS), 2023, pp. 1–4 DOI: 10.1109/APWCS60142.2023.10233966
[PMV02] Vladimir Privman, Dima Mozyrsky and Israel Vagner “Quantum computing with spin qubits in semiconductor structures” In Comput. Phys. Commun. 146, 2002, pp. 331–338 DOI: 10.1016/S0010-4655(02)00424-1
[PPR20] Daniel K. Park, Jonghun Park and June-Koo Kevin Rhee “Quantum-classical reinforcement learning for decoding noisy classical parity information” In Quantum Mach. Intell. 2.1, 2020, pp. 1–11 DOI: 10.1007/s42484-020-00019-5
[PRD96] Elena Pashenkova, Irina Rish and Rina Dechter “Value iteration and policy iteration algorithms for Markov decision problem” Citeseer, 1996 URL: https://www.researchgate.net/publication/2605845_Value_iteration_and_policy_iteration_algorithms_for_Markov_decision_problem
[Rai+23] Serge Rainjonneau et al. “Quantum algorithms applied to satellite mission planning for Earth observation” In IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 16, 2023, pp. 7062–7075 DOI: 10.1109/JSTARS.2023.3287154
[Raj+21] K Rajagopal et al. “Quantum amplitude amplification for reinforcement learning” In Handbook of Reinforcement Learning and Control 325, 2021, pp. 819–833 DOI: 10.1007/978-3-030-60990-0_26
[Ram17] A Ramezanpour “Optimization by a quantum reinforcement algorithm” In Phys. Rev. A 96.5, 2017, pp. 052307 DOI: 10.1103/PhysRevA.96.052307
[Ree23] Volker Reers “Towards Performance Benchmarking for Quantum Reinforcement Learning” In INFORMATIK 2023 - Designing Futures: Zukünfte gestalten. Gesellschaft für Informatik eV, 2023, pp. 1135–1145 DOI: 10.18420/inf2023_126
[Ren+22] Yuzheng Ren et al. “NFT-based intelligence networking for connected and autonomous vehicles: A quantum reinforcement learning approach” In IEEE Network 36.6, 2022, pp. 116–124 DOI: 10.1109/MNET.107.2100469
[RKM22] Farhad Rezazadeh, Sarang Kahvazadeh and Mohammadreza Mosahebfard “Towards Quantum-Enabled 6G Slicing” In arXiv:2212.11755, 2022 DOI: 10.48550/arXiv.2212.11755
[RN94] Gavin A Rummery and Mahesan Niranjan “On-line $Q$ -learning using connectionist systems” Citeseer, 1994 URL: https://www.researchgate.net/publication/2500611_On-Line_Q-Learning_Using_Connectionist_Systems
[Ron19] Pooya Ronagh “The Problem of Dynamic Programming on a Quantum Computer” In arXiv:1906.02229, 2019 DOI: 10.48550/arXiv.1906.02229
[Sag+21] Valeria Saggio et al. “Experimental quantum speed-up in reinforcement learning agents” In Nature 591.7849, 2021, pp. 229–233 DOI: 10.1038/s41586-021-03242-7
[Sag+21a] Valeria Saggio et al. “Quantum speed-ups in reinforcement learning” In Quantum Nanophotonic Materials, Devices, and Systems 2021 11806, 2021, pp. 40–49 DOI: 10.1117/12.2593720
[San+22] Fabio Sanches, Sean Weinberg, Takanori Ide and Kazumitsu Kamiya “Short quantum circuits in reinforcement learning policies for the vehicle routing problem” In Phys. Rev. A 105.6, 2022, pp. 062403 DOI: 10.1103/PhysRevA.105.062403
[San+23] Antonio Sannia et al. “A hybrid classical-quantum approach to speed-up Q-learning” In Sci. Rep. 13.1, 2023, pp. 3913 DOI: 10.1038/s41598-023-30990-5
[SB18] R.S. Sutton and A.G. Barto “Reinforcement Learning: An Introduction” The MIT Press, 2018 URL: http://incompleteideas.net/book/the-book-2nd.html
[Sch+22] Michael Schenk et al. “Hybrid actor-critic algorithm for quantum reinforcement learning at cern beam lines” In arXiv:2209.11044, 2022 DOI: 10.48550/arXiv.2209.11044
[SH20] Erik Sorensen and Wei Hu “Practical Meta-Reinforcement Learning of Evolutionary Strategy with Quantum Neural Networks for Stock Trading” In Journal of Quantum Information Science 10.3, 2020, pp. 43–71 DOI: 10.4236/jqis.2020.103005
[SH23] Maida Shahid and Muhammad Awais Hassan “Introducing Quantum Variational Circuit for Efficient Management of Common Pool Resources” In IEEE Access 11, 2023, pp. 110862–110877 DOI: 10.1109/ACCESS.2023.3322144
[She+20] Kishore S Shenoy, Dev Y Sheth, Bikash K Behera and Prasanta K Panigrahi “Demonstration of a measurement-based adaptation protocol with quantum reinforcement learning on the IBM Q experience platform” In Quantum Inf. Process. 19, 2020, pp. 1–13 DOI: 10.1007/s11128-020-02657-x
[Shi+22] Hiroaki Shinkawa et al. “Bandit approach to conflict-free multi-agent Q-learning in view of photonic implementation” In arXiv:2212.09926, 2022 DOI: 10.48550/arXiv.2212.09926
[Sho97] Peter W. Shor “Polynomial-Time Algorithms for Prime Factorization and Discrete Logarithms on a Quantum Computer” In SIAM J. Comput. 26.5, 1997, pp. 1484–1509 DOI: 10.1137/s0097539795293172
[Sil+18] David Silver et al. “A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play” In Science 362.6419, 2018, pp. 1140–1144 DOI: 10.1126/science.aar6404
[SJA19] Sukin Sim, Peter D Johnson and Alán Aspuru-Guzik “Expressibility and entangling capability of parameterized quantum circuits for hybrid quantum-classical algorithms” In Adv. Quantum Technol. 2.12, 2019, pp. 1900070 DOI: 10.1002/qute.201900070
[SJD22] Andrea Skolik, Sofiene Jerbi and Vedran Dunjko “Quantum agents in the Gym: a variational quantum algorithm for deep Q-learning” In Quantum 6, 2022, pp. 720 DOI: 10.22331/q-2022-05-24-720
[Sko+23] Andrea Skolik et al. “Robustness of quantum reinforcement learning under hardware errors” In EPJ Quantum Technol. 10.1, 2023, pp. 1–43 DOI: 10.1140/epjqt/s40507-023-00166-1
[SMK23] Akash Sinha, Antonio Macaluso and Matthias Klusch “Nav-Q: Quantum Deep Reinforcement Learning for Collision-Free Navigation of Self-Driving Cars” In arXiv:2311.12875, 2023 DOI: 10.48550/arXiv.2311.12875
[SMT23] Yize Sun, Yunpu Ma and Volker Tresp “Differentiable Quantum Architecture Search for Quantum Reinforcement Learning” In IEEE International Conference on Quantum Computing and Engineering (QCE) 2, 2023, pp. 15–19 DOI: 10.1109/QCE57702.2023.10177
[SP18] Maria Schuld and Francesco Petruccione “Supervised Learning with Quantum Computers” Springer, 2018 URL: https://link.springer.com/book/10.1007/978-3-319-96424-9
[Sri+18] Theeraphot Sriarunothai et al. “Speeding-up the decision making of a learning agent using an ion trap quantum processor” In Quantum Sci. Technol. 4.1, 2018, pp. 015014 DOI: 10.1088/2058-9565/aaef5e
[SSB23] André Sequeira, Luis Paulo Santos and Luis Soares Barbosa “Policy gradients using variational quantum circuits” In Quantum Mach. Intell. 5.1, 2023, pp. 18 DOI: 10.1007/s42484-023-00101-8
[SSM21] M. Schuld, R. Sweke and J.J. Meyer “Effect of data encoding on the expressive power of variational quantum-machine-learning models” In Phys. Rev. A 103.3, 2021, pp. 032430 DOI: 10.1103/physreva.103.032430
[Sto+20] James Stokes, Josh Izaac, Nathan Killoran and Giuseppe Carleo “Quantum Natural Gradient” In Quantum 4, 2020, pp. 269 DOI: 10.22331/q-2020-05-25-269
[Sut+99] Richard S. Sutton, David McAllester, Satinder Singh and Yishay Mansour “Policy gradient methods for reinforcement learning with function approximation” In NeurIPS 12, 1999 URL: https://papers.nips.cc/paper/1999/hash/464d828b85b0bed98e80ade0a5c43b0f-Abstract.html
[SWM10] Mark Saffman, Thad G Walker and Klaus Mølmer “Quantum information with Rydberg atoms” In Rev. Mod. Phys. 82.3, 2010, pp. 2313 DOI: 10.1103/RevModPhys.82.2313
[Tei21] Miguel Alexandre Brandão Teixeira “Quantum Reinforcement Learning Applied to Games”, 2021 URL: https://repositorio-aberto.up.pt/bitstream/10216/135628/2/487581.pdf
[Tha+23] Supanut Thanasilp et al. “Subtleties in the trainability of quantum machine learning models” In Quantum Mach. Intell. 5.1, 2023, pp. 21 DOI: 10.1007/s42484-023-00103-6
[TRC21] Miguel Teixeira, Ana Paula Rocha and Antonio JM Castro “Quantum Reinforcement Learning Applied to Board Games” In IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, 2021, pp. 343–350 DOI: 10.1145/3486622.3493944
[Tru+23] Nguyen Truong Thu Ngo, Tien-Fu Lu, James Quach and Peter Bruza “Investigating Quantum Reinforcement Learning structure to the CartPole control task” In Proceedings of the 9th International Conference of Asian Society for Precision Engineering and Nanotechnology (ASPEN2022), 2023, pp. 227–230 URL: https://eprints.qut.edu.au/239327/
[VGS16] Hado Van Hasselt, Arthur Guez and David Silver “Deep reinforcement learning with double q-learning” In Proceedings of the AAAI Conference on Artificial Intelligence 30.1, 2016 DOI: 10.1609/aaai.v30i1.10295
[Wan+21] Daochen Wang et al. “Quantum algorithms for reinforcement learning with a generative model” In International Conference on Machine Learning (ICML) 139 PMLR, 2021, pp. 10916–10926 URL: https://proceedings.mlr.press/v139/wang21w.html
[Wan+21a] Daochen Wang, Xuchen You, Tongyang Li and Andrew M Childs “Quantum exploration algorithms for multi-armed bandits” In Proceedings of the AAAI Conference on Artificial Intelligence 35.11, 2021, pp. 10102–10110 DOI: 10.1609/aaai.v35i11.17212
[Wan+23] Zongqi Wan et al. “Quantum multi-armed bandits and stochastic linear bandits enjoy logarithmic regrets” In Proceedings of the AAAI Conference on Artificial Intelligence 37.8, 2023, pp. 10087–10094 DOI: 10.1609/aaai.v37i8.26202
[WAU20] Zhikang T Wang, Yuto Ashida and Masahito Ueda “Deep reinforcement learning control of quantum cartpoles” In Phys. Rev. Lett. 125.10, 2020, pp. 100401 DOI: 10.1103/PhysRevLett.125.100401
[WD92] Christopher J.C.H. Watkins and Peter Dayan “Q-learning” In Mach. Learn. 8.3, 1992, pp. 279–292 DOI: 10.1007/BF00992698
[Wei+21] Qing Wei, Hailan Ma, Chunlin Chen and Daoyi Dong “Deep Reinforcement Learning With Quantum-Inspired Experience Replay” In IEEE Trans. Cybern. 52.9, 2021, pp. 9326–9338 DOI: 10.1109/TCYB.2021.3053414
[Wie+22] Simon Wiedemann, Daniel Hein, Steffen Udluft and Christian Mendl “Quantum Policy Iteration via Amplitude Estimation and Grover Search–Towards Quantum Advantage for Reinforcement Learning” In arXiv:2206.04741, 2022 DOI: 10.48550/arXiv.2206.04741
[Wie+22a] David Wierichs, Josh Izaac, Cody Wang and Cedric Yen-Yu Lin “General parameter-shift rules for quantum gradients” In Quantum 6, 2022, pp. 677 DOI: 10.22331/q-2022-03-30-677
[Wie+23] Marco Wiedmann et al. “An Empirical Comparison of Optimizers for Quantum Machine Learning with SPSA-based Gradients” In IEEE International Conference on Quantum Computing and Engineering (QCE) 1, 2023, pp. 450–456 DOI: 10.1109/QCE57702.2023.00058
[Wie21] Simon Wiedemann “Modelling and Solving Reinforcement Learning Problems on Quantum Computers”, 2021
[Wil92] Ronald J. Williams “Simple statistical gradient-following algorithms for connectionist reinforcement learning” In Mach. Learn. 8.3, 1992, pp. 229–256 DOI: 10.1007/BF00992696
[Wu+23] Shaojun Wu, Shan **, Dingding Wen and Xiaoting Wang “Quantum reinforcement learning in continuous action space” In arXiv:2012.10711, 2023 DOI: 10.48550/arXiv.2012.10711
[Yan+22] Rudai Yan, Yu Wang, Yan Xu and Jiahong Dai “A multiagent quantum deep reinforcement learning method for distributed frequency control of islanded microgrids” In IEEE Trans. Control Netw. Syst. 9.4, 2022, pp. 1622–1632 DOI: 10.1109/TCNS.2022.3140702
[Yan23] Junzheng Yang “Apply Deep Reinforcement Learning with Quantum Computing on the Pricing of American Options” In Internet Finance and Digital Economy, 2023, pp. 675–694 DOI: 10.1142/9789811267505_0050
[Yin+21] Linfei Yin et al. “Quantum deep reinforcement learning for rotor side converter control of double-fed induction generator-based wind turbines” In Engineering Applications of Artificial Intelligence 106, 2021, pp. 104451 DOI: 10.1016/j.engappai.2021.104451
[YN06] J.Q. You and Franco Nori “Superconducting Circuits and Quantum Information” In Phys. Today 58.11, 2006, pp. 42 DOI: 10.1063/1.2155757
[YPK23] Won Joon Yun, Jihong Park and Joongheon Kim “Quantum multi-agent meta reinforcement learning” In Proceedings of the AAAI Conference on Artificial Intelligence 37.9, 2023, pp. 11087–11095 DOI: 10.1609/aaai.v37i9.26313
[Yun+22] Won Joon Yun et al. “Quantum multi-agent reinforcement learning via variational quantum circuit design” In 2022 IEEE 42nd International Conference on Distributed Computing Systems (ICDCS), 2022, pp. 1332–1335 DOI: 10.1109/ICDCS54860.2022.00151
[Yun+23] Won Joon Yun et al. “Quantum Multi-Agent Actor-Critic Neural Networks for Internet-Connected Multi-Robot Coordination in Smart Factory Management” In IEEE Internet Things J. 10.11, 2023, pp. 9942–9952 DOI: 10.1109/JIOT.2023.3234911
[Zha+11] Tingting Zhao, Hirotaka Hachiya, Gang Niu and Masashi Sugiyama “Analysis and Improvement of Policy Gradient Estimation” In Adv. Neural Inf. Process. Syst. 24, 2011, pp. 118–129 DOI: 10.1016/j.neunet.2011.09.005
[Zha+19] Xiao-Ming Zhang et al. “When does reinforcement learning stand out in quantum control? A comparative study on state preparation” In NPJ Quantum Inf. 5.1, 2019, pp. 85 DOI: 10.1038/s41534-019-0201-8
[Zha+22] Shi-Xin Zhang, Chang-Yu Hsieh, Shengyu Zhang and Hong Yao “Differentiable quantum architecture search” In Quantum Sci. Technol. 7.4, 2022, pp. 045023 DOI: 10.1088/2058-9565/ac87cd
[Zho+23] Han Zhong et al. “Provably Efficient Exploration in Quantum Reinforcement Learning with Logarithmic Worst-Case Regret” In arXiv:2302.10796, 2023 DOI: 10.48550/arXiv.2302.10796
[Zie+08] Brian D. Ziebart, Andrew L. Maas, J.Andrew Bagnell and Anind K. Dey “Maximum entropy inverse reinforcement learning” In AAAI 8, 2008, pp. 1433–1438 URL: https://www.aaai.org/Papers/AAAI/2008/AAAI08-227.pdf
[ZY23] Jun Zhao and Wenhan Yu “Quantum Multi-Agent Reinforcement Learning as an Emerging AI Technology: A Survey and Future Directions” In TechRxiv, 2023 DOI: 10.36227/techrxiv.24563293.v1