HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: glossaries-prefix
  • failed: yquant
  • failed: environ

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2305.00905v2 [quant-ph] 18 Mar 2024
\NewEnviron

scaletikzpicturetowidth[1]\BODY

BCQQ: Batch-Constraint Quantum Q-Learning with Cyclic Data Re-uploading

Maniraman Periyasamy, Marc Hölle, Marco Wiedmann, Daniel D. Scherer, Axel Plinge, Christopher Mutschler Fraunhofer IIS, Fraunhofer Institute for Integrated Circuits IIS, Nuremberg, Germany
Abstract

Deep reinforcement learning (DRL) often requires a large number of data and environment interactions, making the training process time-consuming. This challenge is further exacerbated in the case of batch RL, where the agent is trained solely on a pre-collected dataset without environment interactions. Recent advancements in quantum computing suggest that quantum models might require less data for training compared to classical methods. In this paper, we investigate this potential advantage by proposing a batch RL algorithm that utilizes variational quantum circuits (VQCs) as function approximators within the discrete batch-constraint deep Q-learning (BCQ) algorithm. Additionally, we introduce a novel data re-uploading scheme by cyclically shifting the order of input variables in the data encoding layers. We evaluate the efficiency of our algorithm on the OpenAI CartPole environment and compare its performance to the classical neural network-based discrete BCQ.

Index Terms:
quantum reinforcement learning, batch reinforcement learning, variational quantum computing, data uploading, data re-uploading, batch quantum reinforcement learning, offline quantum reinforcement learning.
©This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

I Introduction

The challenge of applying reinforcement learning (RL) in real-world problems lies in its training process. In contrast to fully data-driven machine learning (ML) procedures such as supervised learning, RL learns via environment interactions. An RL agent follows a policy and chooses actions that change the state of the environment, for which it receives a reward (possibly after each interaction). The agent’s objective is to learn a policy that maximizes the long-term reward. Using this framework, agents based on deep neural networks (DNNs) have been remarkably successful in a variety of complex tasks, including super-human performance in computationally-hard board games [1] and discovering fast matrix multiplication algorithms [2].

Unfortunately, this interactive approach is not feasible in many safety-critical scenarios, that potentially benefit from RL, e.g., robotics or healthcare. While in general it is possible to train RL agents in a simulator, it is often non-trivial to deploy them in the real world due to the domain gap between simulation and reality. In these cases, it would be beneficial to utilize real-world data, gathered by an expert operator, and train in a purely data-driven, offline fashion. However, current offline algorithms require large datasets to match the performance of algorithms trained with environment interactions [3]. This is an impediment in domains where only limited data is available. One of the main reasons for this performance loss is that the distribution of states and actions during testing can drastically differ from the training data. Although RL research heavily tackles this lately, these challenges still remain mostly unresolved.

With the advent of the first practical quantum computers, so-called noisy intermediate scale quantum (NISQ) devices, it is natural to investigate whether this new computing paradigm could be leveraged to improve RL. Factors like low gate fidelity and coherence times of these NISQ devices cause applied research to focus on hybrid quantum-classical schemes such as variational quantum circuits (VQCs) [4, 5, 6]. These can be considered as the quantum analog to classical DNNs and are therefore also called quantum neural networks. Theoretical work indicates that VQCs are more data-efficient than classical ML methods  [7]. As data can become a bottleneck when learning offline, we are interested in investigating if this theoretical advantage can be turned into a practical performance gain. This would translate to a quantum reinforcement learning (QRL) algorithm [8] that can learn a policy from a small dataset and outperform a classical policy trained on the same data. Currently, the limited number of qubits together with the corresponding hardware topology and the complexity of numerical simulations make it intractable to benchmark QRL algorithms on state-of-the-art environments on a large scale. However, proof-of-concept experiments can be executed in low-complexity environments such as OpenAI’s CartPole [9].

Our contribution in this paper is two-fold. First, we show how to apply function approximation within the discrete batch-constraint deep Q-learning algorithm [10] with VQCs, where we find a performance advantage over the classical counterpart in CartPole. Near-optimal performance can be achieved in a low data regime, where a classical agent with similar number of parameters that is able to solve CartPole in an online setting, fails offline. Second, we present a cyclic data re-uploading scheme that proved to be advantageous in this batch RL context and analyze the effective dimension of the resulting VQC.

II Theoretical Background

II-A General Framework of Reinforcement Learning

On an abstract level, RL can be modelled in the framework of Markov-Decision-Problems (MDPs) [11], in which an agent interacts with its environment. An MDP is defined by a set of states 𝒮𝒮\mathcal{S}caligraphic_S, actions 𝒜𝒜\mathcal{A}caligraphic_A, reward function R:𝒮×𝒜:𝑅𝒮𝒜R:\mathcal{S}\times\mathcal{A}\rightarrow\mathds{R}italic_R : caligraphic_S × caligraphic_A → blackboard_R and discount factor γ[0,1]𝛾01\gamma\in[0,1]italic_γ ∈ [ 0 , 1 ]. The goal is to find an optimal policy, that assigns each action a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A a probability π(a|s)𝜋conditional𝑎𝑠\pi(a|s)italic_π ( italic_a | italic_s ) with which the agent should take that particular action, given that the environment is in state s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S. In this case, optimality means that the policy should maximize the expected, discounted reward for any given initial state s0𝒮subscript𝑠0𝒮s_{0}\in\mathcal{S}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_S, i.e.

π*argmax𝜋t=0𝔼atπst+1TγtR(st,at).superscript𝜋𝜋argmaxsuperscriptsubscript𝑡0similar-tosubscript𝑎𝑡𝜋similar-tosubscript𝑠𝑡1𝑇𝔼superscript𝛾𝑡𝑅subscript𝑠𝑡subscript𝑎𝑡\pi^{*}\in\underset{\pi}{\mathrm{argmax}}\sum_{t=0}^{\infty}\underset{\begin{% subarray}{c}a_{t}\sim\pi\\ s_{t+1}\sim T\end{subarray}}{\mathds{E}}\gamma^{t}R(s_{t},a_{t}).italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∈ underitalic_π start_ARG roman_argmax end_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT start_UNDERACCENT start_ARG start_ROW start_CELL italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π end_CELL end_ROW start_ROW start_CELL italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_T end_CELL end_ROW end_ARG end_UNDERACCENT start_ARG blackboard_E end_ARG italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (1)

One usually defines a state-action value function or Q-function

Qπ(s,a)=𝔼s0T(R(a,s)+γt=0𝔼atπst+1TγtR(st,at))subscript𝑄𝜋𝑠𝑎similar-tosubscript𝑠0𝑇𝔼𝑅𝑎𝑠𝛾superscriptsubscript𝑡0similar-tosubscript𝑎𝑡𝜋similar-tosubscript𝑠𝑡1𝑇𝔼superscript𝛾𝑡𝑅subscript𝑠𝑡subscript𝑎𝑡Q_{\pi}(s,a)=\underset{s_{0}\sim T}{\mathds{E}}\left(R(a,s)+\gamma\sum_{t=0}^{% \infty}\underset{\begin{subarray}{c}a_{t}\sim\pi\\ s_{t+1}\sim T\end{subarray}}{\mathds{E}}\gamma^{t}R(s_{t},a_{t})\right)italic_Q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s , italic_a ) = start_UNDERACCENT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_T end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_R ( italic_a , italic_s ) + italic_γ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT start_UNDERACCENT start_ARG start_ROW start_CELL italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π end_CELL end_ROW start_ROW start_CELL italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_T end_CELL end_ROW end_ARG end_UNDERACCENT start_ARG blackboard_E end_ARG italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) (2)

as the expected, discounted cumulative reward from choosing action a𝑎aitalic_a in state s𝑠sitalic_s and following the policy π𝜋\piitalic_π from there on. It can be shown that the Q-function of an optimal policy must satisfy a recursive relation known as the Bellman optimality equation [12]

Q*(s,a)=𝔼sT(R(s,a)+γmaxa𝒜Q*(s,a)).superscript𝑄𝑠𝑎similar-tosuperscript𝑠𝑇𝔼𝑅𝑠𝑎𝛾superscript𝑎𝒜maxsuperscript𝑄superscript𝑠superscript𝑎Q^{*}(s,a)=\underset{s^{\prime}\sim T}{\mathds{E}}\left(R(s,a)+\gamma\underset% {a^{\prime}\in\mathcal{A}}{\text{max}}Q^{*}(s^{\prime},a^{\prime})\right).italic_Q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s , italic_a ) = start_UNDERACCENT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_T end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_R ( italic_s , italic_a ) + italic_γ start_UNDERACCENT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_A end_UNDERACCENT start_ARG max end_ARG italic_Q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) . (3)

On the other hand, given the optimal Q-function Q*superscript𝑄Q^{*}italic_Q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT one can represent the corresponding optimal policy as

π*(a|s)=1𝒩{1 , if aargmax a𝒜Q*(s,a)0 , elsesuperscript𝜋conditional𝑎𝑠1𝒩cases1 , if 𝑎superscript𝑎𝒜argmax superscript𝑄𝑠superscript𝑎𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒0 , else𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒\pi^{*}(a|s)=\frac{1}{\mathcal{N}}\begin{cases}1\text{ , if }a\in\underset{a^{% \prime}\in\mathcal{A}}{\text{argmax }}Q^{*}(s,a^{\prime})\\ 0\text{ , else}\end{cases}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_a | italic_s ) = divide start_ARG 1 end_ARG start_ARG caligraphic_N end_ARG { start_ROW start_CELL 1 , if italic_a ∈ start_UNDERACCENT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_A end_UNDERACCENT start_ARG argmax end_ARG italic_Q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 0 , else end_CELL start_CELL end_CELL end_ROW (4)

where 𝒩𝒩\mathcal{N}caligraphic_N is a normalization factor assuring that all probabilities sum to unity.

II-B Online vs. Batch and On-Policy vs. Off-Policy RL

The task of learning an optimal policy can be approached in different ways. One fundamental distinction between RL algorithms is the kind of data used in the training process. If the learning algorithm interacts with the environment during the training process, it is called an online algorithm. However, this does not mean that the algorithm needs to be on-policy, i.e. exclusively use it’s current estimate of an optimal policy to gather data. Off-policy algorithms on the other hand often employ explorative policies to gather experience from the environment and train a separate policy that should eventually solve the given task. In contrast to this, batch or offline algorithms only need access to data that was collected from the environment beforehand. In principle, any off-policy learning algorithm can be used for batch RL, but it requires special care to avoid problems during training [10]. Many offline RL approaches rely on Q-learning with deep Q-networks (DQN), as explained in the next chapter.

II-C Deep Q-Learning

Q-learning relies on the fact that an optimal policy is induced by a Q-function that solves the Bellman equation (3). A popular approach for large-scale problems is DQN, which uses DNNs as parametrized function approximators for Q*superscript𝑄Q^{*}italic_Q start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT [13]. They can be trained on a loss function that is derived from the Bellman equation (3).

l(θ)=𝔼𝑀(r+γmaxa𝒜Qθ(s,a)Qθ(s,a))2𝑙𝜃𝑀𝔼superscript𝑟𝛾superscript𝑎𝒜maxsubscript𝑄superscript𝜃superscript𝑠superscript𝑎subscript𝑄𝜃𝑠𝑎2l(\theta)=\underset{M}{\mathds{E}}\left(r+\gamma\underset{a^{\prime}\in% \mathcal{A}}{\text{max}}Q_{\theta^{\prime}}(s^{\prime},a^{\prime})-Q_{\theta}(% s,a)\right)^{2}italic_l ( italic_θ ) = underitalic_M start_ARG blackboard_E end_ARG ( italic_r + italic_γ start_UNDERACCENT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_A end_UNDERACCENT start_ARG max end_ARG italic_Q start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s , italic_a ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (5)

where, M𝑀Mitalic_M is the mini-batch sampled from buffer \mathcal{B}caligraphic_B consisting of transitions (s,a,r,s)𝑠𝑎𝑟superscript𝑠(s,a,r,s^{\prime})( italic_s , italic_a , italic_r , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ).

II-D Variational Quantum Circuits for RL

VQCs have gained a lot of attention from the quantum computing community in the recent years due to their NISQ feasibility. It has already been established, that they are a potentially powerful platform for quantum-enhanced ML [14, 15, 16]. While any quantum computation can be decomposed into a sequence of quantum gates, the power of VQCs arises from the fact that some of these gates are parameterized by a continuous variable. For example, any single qubit gate can be expressed as a rotation in 3D space acting on the Bloch vector and is therefore parameterized by the three Euler angles corresponding to this rotation. This allows us to use quantum circuits as parameterized function approximators. For a detailed introduction into the framework and theory of quantum computing, we refer the reader to [17].

A VQC that represents a function fθ(x)subscript𝑓𝜃𝑥f_{\theta}(x)italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) is composed of three basic components [18]: 1. The data encoding. The unitary U(x)𝑈𝑥U(x)italic_U ( italic_x ) encodes the input data x𝑥xitalic_x into a quantum state |ψ(x)=U(x)|0ket𝜓𝑥𝑈𝑥ket0\ket{\psi(x)}=U(x)\ket{0}| start_ARG italic_ψ ( italic_x ) end_ARG ⟩ = italic_U ( italic_x ) | start_ARG 0 end_ARG ⟩. 2. The variational layers. A different unitary U(θ)𝑈𝜃U(\theta)italic_U ( italic_θ ) maps the input state |ψ(x)ket𝜓𝑥\ket{\psi(x)}| start_ARG italic_ψ ( italic_x ) end_ARG ⟩ onto the output state |ϕ(x,θ)=U(θ)|ψ(x)ketitalic-ϕ𝑥𝜃𝑈𝜃ket𝜓𝑥\ket{\phi(x,\theta)}=U(\theta)\ket{\psi(x)}| start_ARG italic_ϕ ( italic_x , italic_θ ) end_ARG ⟩ = italic_U ( italic_θ ) | start_ARG italic_ψ ( italic_x ) end_ARG ⟩. A common approach is to decompose U(θ)𝑈𝜃U(\theta)italic_U ( italic_θ ) into a set of repeated layers, which contain both parameterized rotations and entangling gates. 3. The measurement. An observable O𝑂Oitalic_O is chosen and its expectation value is estimated over several runs of the circuit. The result represents the output of the VQC

fθ(x)=ϕ(x,θ)|O|ϕ(x,θ).subscript𝑓𝜃𝑥braitalic-ϕ𝑥𝜃𝑂ketitalic-ϕ𝑥𝜃f_{\theta}(x)=\bra{\phi(x,\theta)}O\ket{\phi(x,\theta)}.italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) = ⟨ start_ARG italic_ϕ ( italic_x , italic_θ ) end_ARG | italic_O | start_ARG italic_ϕ ( italic_x , italic_θ ) end_ARG ⟩ . (6)

It has been shown in [19, 20] that a given VQC actually realizes a specific Fourier sum

fθ(x)=ωΩcω(θ)eiωx,subscript𝑓𝜃𝑥subscript𝜔Ωsubscript𝑐𝜔𝜃superscript𝑒𝑖𝜔𝑥f_{\theta}(x)=\sum_{\omega\in\Omega}c_{\omega}(\theta)e^{i\omega x},italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_ω ∈ roman_Ω end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_θ ) italic_e start_POSTSUPERSCRIPT italic_i italic_ω italic_x end_POSTSUPERSCRIPT , (7)

where the available frequency spectrum ΩΩ\Omegaroman_Ω is determined by the data encoding. Repeating the data encoding, i.e. the action of the unitary U(x)𝑈𝑥U(x)italic_U ( italic_x ), multiple times throughout the circuit (cf. section Data Re-Uploading) enlarges the available spectrum. Thus, a broader class of functions can be accessed through this so-called data re-uploading.

To take this analogy even further: Just like NNs, VQCs can be trained from a set of samples {(x,f(x))}𝑥𝑓𝑥\{(x,f(x))\}{ ( italic_x , italic_f ( italic_x ) ) } of the desired function. The optimization of the parameters θ𝜃\thetaitalic_θ in the training phase of the VQC is carried out on a classical computer, which may evaluate the VQC as a quantum subroutine. The hope is that VQCs are able to access a broader class of functions with fewer parameters and enable more data efficient learning. Since the full action of the unitaries U(x)𝑈𝑥U(x)italic_U ( italic_x ) and U(θ)𝑈𝜃U(\theta)italic_U ( italic_θ ) is believed to be hard to simulate classically for appropriately designed VQCs, quantum computers might provide an advantage in machine learning tasks.

In the context of QRL, a common approach is to employ VQCs instead of NNs to represent the Q-function in DQN. The state of the environment is used as the input to the VQC. Each available action a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A is assigned an observable Oasubscript𝑂𝑎O_{a}italic_O start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, such that the Q-function estimate is given by

Qθ(a|s)=ϕ(s,θ)|Oa|ϕ(s,θ)subscript𝑄𝜃conditional𝑎𝑠braitalic-ϕ𝑠𝜃subscript𝑂𝑎ketitalic-ϕ𝑠𝜃Q_{\theta}(a|s)=\bra{\phi(s,\theta)}O_{a}\ket{\phi(s,\theta)}italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a | italic_s ) = ⟨ start_ARG italic_ϕ ( italic_s , italic_θ ) end_ARG | italic_O start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | start_ARG italic_ϕ ( italic_s , italic_θ ) end_ARG ⟩ (8)

However, since the output of the VQC is the quantum mechanical expectation value of an observable (ref. equation (6)), the range of possible output values is quite limited. Popular choices for the observables Oasubscript𝑂𝑎O_{a}italic_O start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT are combinations of single-qubit Pauli matrices. This restricts the output of the VQCs to the interval [1,1]11[-1,1][ - 1 , 1 ]. However, the true Q-function might be beyond this range. To address this, one can simply scale each expectation value by a classical weight wasubscript𝑤𝑎w_{a}italic_w start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, which is also inferred from the training process [21].

II-E Efficient Gradient Estimation on Quantum Devices

Since computing an explicit representation of the unitary U(θ)𝑈𝜃U(\theta)italic_U ( italic_θ ) of the VQC is classically hard, one cannot efficiently compute the gradient of the VQC output with respect to the parameters θ𝜃\thetaitalic_θ with classical techniques. Although there is a way of obtaining the gradient directly from the quantum device with the parameter-shift rule [22], this approach still suffers from the fact that it requires 2p2𝑝2p2 italic_p expectation value estimations for a circuit with p𝑝pitalic_p parameters, which quickly becomes intractable even for medium sized circuits due to the large number of required circuit runs. This is why gradient-free optimization schemes, like simultaneous perturbation stochastic approximation (SPSA), which always uses only two calls to the quantum subroutine for each update step, have become of interest to the QML community. It is especially well suited for noisy objective functions which arise when training VQCs [23, 24, 25]. It computes an approximate gradient, that can be fed to state-of-the-art gradient based optimizers like AMSGrad [26] to efficiently train medium-sized VQCs [27]. However, we suspect that this method introduces instabilities in the training of offline RL agents, when no appropriate early-stop** criterion is available.

III Related Work

The following section starts with a short summary of QRL, followed by a survey of relevant batch RL literature.

III-A Quantum Reinforcement Learning

The use of quantum computing for RL is an emerging field. Dong et al. [28, 29] were among its pioneers, proposing an algorithm that learns a state-value function by using the Grover search algorithm [30]. More recently, Chen et al. [31] proposed one of the first VQC-based QRL algorithms, by replacing neural networks in the DQN algorithm with VQCs. For a detailed picture of QRL we refer to Meyer et al. [8]. Upon completion of our work, we came across the works of Cheng et al. [32], which uses VQCs in conservative Q-learning for offline RL. There the authors showed that a quantum agent can learn to solve an environment in an offline fashion. However, the authors did not investigate the performance of quantum agents on noisy data or partial trajectories.

III-B Batch Reinforcement Learning

When RL is utilized in scenarios where humans or equipment can be harmed, the agent is usually trained in a simulated environment beforehand. Consequently, the agent’s performance might suffer from limited modelling of real-world processes and creating these simulations causes an additional overhead. To resolve this, the objective of batch RL is to learn an optimal policy from a set of real-world data. This training data is gathered by a behavior policy (e.g. expert human operator) interacting with the environment. As an extreme case, one can even consider a behavior policy selecting actions at random. Off-policy algorithms such as DQN are similar in spirit to offline learning, since they are in principle agnostic to how the experience was gathered. However, Fujimoto et al. [33] found that Q-value estimates of DQN diverge in an offline setting, leading to the offline trained algorithms performing worse than the same algorithms trained on the same dataset in an online manner. Agarwal et al. [3] demonstrated, that offline trained agents can reach comparable performance on the OpenAI Atari 2600 games [9], but a very large (50 million environment interactions) and diverse dataset was used. This can be explained by the fact, that algorithms like DQN employ a replay buffer of past environment interactions, which causes the gathered data to be correlated to the current policy [13]. In offline RL, this correlation is not present and there is a distributional shift between training and testing. Furthermore, out-of-distribution states may arise, if during testing the agent encounters a state that was not part of its training data.

Additionally, Q-learning-based algorithms like DQN tend to suffer from overestimating Q-values, due to the objective of maximizing the expected return [34]. In the online scenario, this is not as severe, due to corrective feedback from the environment during training. However, since this feedback is missing in an offline setting, this overestimation bias together with the distributional shift causes the policy to extrapolate poorly to unfamiliar states. A problem to which Fujimoto et al. [33] refer to as extrapolation error. Approaches trying to alleviate this error typically introduce some constraint on the policy to keep it close to the behavior policy. Typically, closeness is determined either directly with respect to a probability metric or penalty terms in the policy update [35].

III-C Batch-Constraint Deep Q-Learning

An algorithm implicitly following the first approach is batch-constraint deep Q-Learning (BCQ) [33]. The key idea is that in order to avoid the distributional shift, a trained policy should induce a similar state-action visitation to what is observed in the batch. In the following, such policies are called batch-constrained. To achieve this, BCQ uses a generative model Gωsubscript𝐺𝜔G_{\omega}italic_G start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT to preselect likely actions according to the batch. The policy is only allowed to choose from this preselection. In the following, we will restrict the discussion of BCQ to the discrete action setting. In this case, the generative model can be understood as a map Gω:𝒮Δ(𝒜):subscript𝐺𝜔𝒮Δ𝒜G_{\omega}:\mathcal{S}\rightarrow\Delta\left(\mathcal{A}\right)italic_G start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT : caligraphic_S → roman_Δ ( caligraphic_A ) that takes the current environment state as input and outputs the probability with which each action would occur in the batch. In particular, if the batch is filled using transitions from a policy πbsubscript𝜋𝑏\pi_{b}italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT then the generative model should reconstruct this policy, i.e. Gω(a|s)πb(a|s)subscript𝐺𝜔conditional𝑎𝑠subscript𝜋𝑏conditional𝑎𝑠G_{\omega}(a|s)\approx\pi_{b}(a|s)italic_G start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_a | italic_s ) ≈ italic_π start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_a | italic_s ). From this, one can preselect the actions by discarding actions whose probability relative to the most likely one is below a threshold τ𝜏\tauitalic_τ

𝒜~(s)={a𝒜|Gω(a|s)maxa^𝒜Gω(a^|s)>τ}.~𝒜𝑠conditional-set𝑎𝒜subscript𝐺𝜔conditional𝑎𝑠^𝑎𝒜maxsubscript𝐺𝜔conditional^𝑎𝑠𝜏\tilde{\mathcal{A}}(s)=\left\{a\in\mathcal{A}\Bigg{|}\frac{G_{\omega}(a|s)}{% \underset{\hat{a}\in\mathcal{A}}{\text{max}}G_{\omega}(\hat{a}|s)}>\tau\right\}.over~ start_ARG caligraphic_A end_ARG ( italic_s ) = { italic_a ∈ caligraphic_A | divide start_ARG italic_G start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_a | italic_s ) end_ARG start_ARG start_UNDERACCENT over^ start_ARG italic_a end_ARG ∈ caligraphic_A end_UNDERACCENT start_ARG max end_ARG italic_G start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( over^ start_ARG italic_a end_ARG | italic_s ) end_ARG > italic_τ } . (9)

Both the policy

πθ(a|s)=1𝒩{1 , if aargmaxa𝒜~(s)Qθ(s,a)0 , elsesubscript𝜋𝜃conditional𝑎𝑠1𝒩cases1 , if 𝑎superscript𝑎~𝒜𝑠argmaxsubscript𝑄𝜃𝑠superscript𝑎𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒0 , else𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒\pi_{\theta}(a|s)=\frac{1}{\mathcal{N}}\begin{cases}1\text{ , if }a\in% \underset{a^{\prime}\in\tilde{\mathcal{A}}(s)}{\text{argmax}}Q_{\theta}(s,a^{% \prime})\\ 0\text{ , else}\end{cases}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a | italic_s ) = divide start_ARG 1 end_ARG start_ARG caligraphic_N end_ARG { start_ROW start_CELL 1 , if italic_a ∈ start_UNDERACCENT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ over~ start_ARG caligraphic_A end_ARG ( italic_s ) end_UNDERACCENT start_ARG argmax end_ARG italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 0 , else end_CELL start_CELL end_CELL end_ROW

and the target in the loss function

l(θ)=𝔼(s,a,r,s)(r+γmaxa𝒜~(s)Qθ(s,a)Qθ(s,a))2𝑙𝜃𝑠𝑎𝑟superscript𝑠𝔼superscript𝑟𝛾superscript𝑎~𝒜superscript𝑠maxsubscript𝑄superscript𝜃superscript𝑠superscript𝑎subscript𝑄𝜃𝑠𝑎2l(\theta)=\underset{(s,a,r,s^{\prime})\in\mathcal{B}}{\mathds{E}}\left(r+% \gamma\underset{a^{\prime}\in\tilde{\mathcal{A}}(s^{\prime})}{\text{max}}Q_{% \theta^{\prime}}(s^{\prime},a^{\prime})-Q_{\theta}(s,a)\right)^{2}\\ italic_l ( italic_θ ) = start_UNDERACCENT ( italic_s , italic_a , italic_r , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ caligraphic_B end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_r + italic_γ start_UNDERACCENT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ over~ start_ARG caligraphic_A end_ARG ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_UNDERACCENT start_ARG max end_ARG italic_Q start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s , italic_a ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

are updated to only consider this preselection of actions. The generative model itself is trained with a standard cross-entropy loss

l(ω)=(s,a)log(Gω(a|s)).𝑙𝜔subscript𝑠𝑎logsubscript𝐺𝜔conditional𝑎𝑠l(\omega)=-\sum_{(s,a)\in\mathcal{B}}\text{log}\left(G_{\omega}(a|s)\right).italic_l ( italic_ω ) = - ∑ start_POSTSUBSCRIPT ( italic_s , italic_a ) ∈ caligraphic_B end_POSTSUBSCRIPT log ( italic_G start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_a | italic_s ) ) .

Additionally, to address the overestimation bias of Q-learning towards underrepresented transitions, a technique called Double DQN [36] is employed. Instead of selecting the maximal action with respect to the target network in the Q-learning target, the maximal action with respect to the current Q-network is chosen, but it is still evaluated using the target network. The corresponding loss function is

l(θ)=𝔼(s,a,r,s)(r+γQθ(s,a)Qθ(s,a))2𝑙𝜃𝑠𝑎𝑟superscript𝑠𝔼superscript𝑟𝛾subscript𝑄superscript𝜃superscript𝑠superscript𝑎subscript𝑄𝜃𝑠𝑎2\displaystyle l(\theta)=\underset{(s,a,r,s^{\prime})\in\mathcal{B}}{\mathds{E}% }\left(r+\gamma Q_{\theta^{\prime}}(s^{\prime},a^{\prime})-Q_{\theta}(s,a)% \right)^{2}italic_l ( italic_θ ) = start_UNDERACCENT ( italic_s , italic_a , italic_r , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ caligraphic_B end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_r + italic_γ italic_Q start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s , italic_a ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (10)
where
aargmaxa~𝒜~(s)Qθ(s,a~).superscript𝑎~𝑎~𝒜superscript𝑠argmaxsubscript𝑄𝜃superscript𝑠~𝑎\displaystyle a^{\prime}\in\underset{\tilde{a}\in\tilde{\mathcal{A}}(s^{\prime% })}{\text{argmax}}Q_{\theta}(s^{\prime},\tilde{a}).italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ start_UNDERACCENT over~ start_ARG italic_a end_ARG ∈ over~ start_ARG caligraphic_A end_ARG ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_UNDERACCENT start_ARG argmax end_ARG italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over~ start_ARG italic_a end_ARG ) .

In this work, we apply the variational quantum deep Q-networks (VQ-DQN) proposed by [37] to get an offline QRL algorithm which we call batch-constraint quantum Q-learning (BCQQ).

IV BCQQ

Classical batch RL methods like discrete BCQ often struggle to learn an optimal policy in scenarios where high-quality training samples are unavailable in abundance. Loosely interpreted, this behavior suggests that function approximation via classical neural networks requires a large amount of data to effectively approximate a policy. However, the VQCs have shown some indication of approximating a function from far fewer samples in the supervised learning context [7]. Therefore, we constructed and conducted a series of experiments to study the capabilities of a VQCs in learning a policy in the batch RL setup.

IV-A RL Environment and Offline Data Collection

The capabilities of a VQC in learning an optimal policy in an online fashion to solve CartPole environments have already been studied in various instances [21, 37]. Therefore, we chose the CartPole-v1 environment from the OpenAI gym [9] as the target environment for all the performed experiments. The offline datasets solving real-world problems often do not have trajectories from the optimal policy for a given setup. To mimic the worst-case scenario, we experimented with the most adverse situation, where the data buffer contains trajectories only from a random policy interacting with the CartPole-v1 environment. Buffers with 102superscript10210^{2}10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, 104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT, and 106superscript10610^{6}10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT samples were collected using the above-mentioned setup. Additionally, we study how well the offline agents are able to learn from an expert policy. To avoid pure imitation, the trajectories gathered by the expert were artificially corrupted with noise, by letting the expert choose a random action with a low probability. Buffers with 100100100100, 50505050, and 25252525 samples were collected using using noisy-expert policy for this experiment.

IV-B Variational Quantum Circuit

Refer to caption
Figure 1: The VQC that is used as the function approximator for the BCQQ algorithm. Note: Each θ𝜃\vec{\theta}over→ start_ARG italic_θ end_ARG block represents the repetition of the variational layer ansatz with different trainable parameters.

The VQC used as the function approximator in the BCQQ is shown in Fig. 1. A four-qubit quantum system was chosen as the target system as the CartPole-v1 environment has a four-dimensional state space. Here, each feature of the observation is encoded into the VQC using a single qubit Rx gate on each qubit. The variational block comprises five layers containing four parameterized Ry, and four parameterized Rz gates each. In addition to the parameterized rotational gates, each layer also includes two-qubit CZ entanglement gates with nearest-neighbor connectivity in the circuit layout. We chose the nearest-neighbor connectivity in the circuit layout as this is one of the most commonly available quantum hardware topologies. The CartPole-v1 has an action space of length two. Therefore, the expectation value of the Pauli-ZZ𝑍𝑍ZZitalic_Z italic_Z observable on qubits 1 and 2 and Pauli-ZZ𝑍𝑍ZZitalic_Z italic_Z observable on qubits 3 and 4 was used to decode the Q-values from the VQC. It is to be noted that the encoding scheme, VQC ansatz, and the decoding schemes used are simple design choices based on previous works [37, 38]. Different combinations of parameter-shift and SPSA-based gradient estimators along with Adam and AMSGrad optimizers were tested for optimizing the trainable parameters.

IV-B1 Data Re-Uploading

Data re-uploading [39] is an encoding strategy where the encoding scheme is repeated throughout the VQC. The ordering of the encoding scheme and the trainable layers in a standard VQC is shown in Fig. 1. Schuld et al. [22] show that the more the encoding layer present in the VQC, the larger the frequency spectrum captured by the VQC. Hence, the encoding scheme was re-introduced before every variational layer. Fig. 2 represents the circuit generated using the data re-uploading method.

{yquant}|0ket0\ket{0}| start_ARG 0 end_ARG ⟩Rx(x0)subscript𝑅𝑥subscript𝑥0R_{x}(x_{0})italic_R start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )Rx(x1)subscript𝑅𝑥subscript𝑥1R_{x}(x_{1})italic_R start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )Rx(x2)subscript𝑅𝑥subscript𝑥2R_{x}(x_{2})italic_R start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )Rx(x3)subscript𝑅𝑥subscript𝑥3R_{x}(x_{3})italic_R start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT )θ𝜃\vec{\theta}over→ start_ARG italic_θ end_ARGRx(x0)subscript𝑅𝑥subscript𝑥0R_{x}(x_{0})italic_R start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )Rx(x1)subscript𝑅𝑥subscript𝑥1R_{x}(x_{1})italic_R start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )Rx(x2)subscript𝑅𝑥subscript𝑥2R_{x}(x_{2})italic_R start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )Rx(x3)subscript𝑅𝑥subscript𝑥3R_{x}(x_{3})italic_R start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT )θ𝜃\vec{\theta}over→ start_ARG italic_θ end_ARG\cdotsRx(x0)subscript𝑅𝑥subscript𝑥0R_{x}(x_{0})italic_R start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )Rx(x1)subscript𝑅𝑥subscript𝑥1R_{x}(x_{1})italic_R start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )Rx(x2)subscript𝑅𝑥subscript𝑥2R_{x}(x_{2})italic_R start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )Rx(x3)subscript𝑅𝑥subscript𝑥3R_{x}(x_{3})italic_R start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT )θ𝜃\vec{\theta}over→ start_ARG italic_θ end_ARGEncodingDRUDRU
Figure 2: Quantum agent with standard data re-uploading strategy

IV-B2 Cyclic Data Re-Uploading

It has been established that spreading encoding gates for the feature vector of a given data point throughout the quantum circuit results in a better representation of the data [40]. Inspired by this, we decided to expose each qubit to all the features of the current input state. We achieve this by slightly modifying the data re-uploading strategy explained in the section on Data Re-Uploading. Contrary to the standard approach, we re-introduced the encoding scheme where the input feature vector is shifted one step in a round-robin fashion. Therefore, we call this type of data re-uploading ”cyclic data re-uploading”. This type of encoding scheme has not been explored in the literature before to the best of our knowledge. Fig. 3 represents the circuit generated using the cyclic data re-uploading method.

Refer to caption
Figure 3: Quantum agent with cyclic data re-uploading strategy

IV-C Discrete Batch-Constraint Quantum Q-Learning

As explained in the sections Introduction and Theoretical Background, this work aims to study the advantages gained by using VQCs as function approximators in the discrete BCQ algorithm to learn an optimal policy for solving a given environment. For the same reason, we replaced the generative model Gωsubscript𝐺𝜔G_{\omega}italic_G start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT and the model approximating the optimal policy with two trainable VQCs, as explained in the section Variational Quantum Circuits for RL. Since we encode data via single qubit rotational gates, each entry of the observation vectors in the dataset is normalized using the encoding scheme presented by [37]. The overall discrete batch-constraint quantum Q-learning algorithm is summarized in Algorithm 1. All the experiments in this study were conducted using the following common hyper-parameters: discount factor γ=0.99𝛾0.99\gamma=0.99italic_γ = 0.99, threshold τ=0.3𝜏0.3\tau=0.3italic_τ = 0.3, and mini-batch size of 32. Three different learning rates α=[0.01,0.001,0.0003]𝛼0.010.0010.0003\alpha=[0.01,0.001,0.0003]italic_α = [ 0.01 , 0.001 , 0.0003 ] were tested for each experiment as the classical and quantum models might need different learning rates for optimal learning on similar problem setups.

Algorithm 1 Discrete BCQQ training algorithm
  Normalize dataset \mathcal{B}caligraphic_B between [π,π]𝜋𝜋[-\pi,\pi][ - italic_π , italic_π ]
  Initialize encoding unitary U()𝑈U(\cdot)italic_U ( ⋅ )
  Initialize Q value approximator using VQC with θ𝜃\thetaitalic_θ and θsuperscript𝜃\theta^{\prime}italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
  Initialize generative model Gωsubscript𝐺𝜔G_{\omega}italic_G start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT using VQC with ω𝜔\omegaitalic_ω
  while training not converged do
     Sample mini-batch M𝑀Mitalic_M from \mathcal{B}caligraphic_B
     for all (s,a,r,s)M𝑠𝑎𝑟superscript𝑠𝑀(s,a,r,s^{\prime})\in M( italic_s , italic_a , italic_r , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ italic_M do
        Get batch-constraint actions 𝒜~(s)~𝒜superscript𝑠\tilde{\mathcal{A}}(s^{\prime})over~ start_ARG caligraphic_A end_ARG ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) from (9)
        Collect aargmaxa~𝒜~(s)Qθ(|ψ(s),a~)superscript𝑎~𝑎~𝒜superscript𝑠argmaxsubscript𝑄𝜃ket𝜓superscript𝑠~𝑎a^{\prime}\in\underset{\tilde{a}\in\tilde{\mathcal{A}}(s^{\prime})}{\text{% argmax}}Q_{\theta}(|\psi(s^{\prime})\rangle,\tilde{a})italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ start_UNDERACCENT over~ start_ARG italic_a end_ARG ∈ over~ start_ARG caligraphic_A end_ARG ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_UNDERACCENT start_ARG argmax end_ARG italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( | italic_ψ ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⟩ , over~ start_ARG italic_a end_ARG )
     end for
     Optimize θ𝜃\thetaitalic_θ w.r.t. l(θ)𝑙𝜃l(\theta)italic_l ( italic_θ )
                = 𝔼𝑀(r+γQθ(|ψ(s),a)Qθ(|ψ(s),a))2𝑀𝔼superscript𝑟𝛾subscript𝑄superscript𝜃ket𝜓superscript𝑠superscript𝑎subscript𝑄𝜃ket𝜓𝑠𝑎2\underset{M}{\mathds{E}}\left(r+\gamma Q_{\theta^{\prime}}(|\psi(s^{\prime})% \rangle,a^{\prime})-Q_{\theta}(|\psi(s)\rangle,a)\right)^{2}underitalic_M start_ARG blackboard_E end_ARG ( italic_r + italic_γ italic_Q start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( | italic_ψ ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⟩ , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( | italic_ψ ( italic_s ) ⟩ , italic_a ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
     Optimize ω𝜔\omegaitalic_ω w.r.t l(ω)=𝔼𝑀log(Gω(a,|ψ(s)))𝑙𝜔𝑀𝔼logsubscript𝐺𝜔𝑎ket𝜓𝑠l(\omega)=-\underset{M}{\mathds{E}}\text{log}\left(G_{\omega}(a,|\psi(s)% \rangle)\right)italic_l ( italic_ω ) = - underitalic_M start_ARG blackboard_E end_ARG log ( italic_G start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_a , | italic_ψ ( italic_s ) ⟩ ) )
     if target VQC update then
        θθsuperscript𝜃𝜃\theta^{\prime}\leftarrow\thetaitalic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← italic_θ
     end if
  end while
Refer to caption
Figure 4: Figure (a) shows the eigenvalue spectrum of average FIM for the classical model plotted as a histogram with normalized counts. Figure (b) shows the eigenvalue spectrum of average FIM for quantum models plotted as a histogram with normalized counts. Figure (c) shows the effective dimension results for both classical and quantum models. The FIM is calculated using 500 data points sampled from the CartPole-v1 states and 100 random parameter sets.

IV-D Model selection

In order to understand the difference between cyclic and standard data re-uploading, we analyze the effective dimension of the resulting models. The effective dimension captures the expressivity of an ML model and is based on the empirical FIM [41, 42, 43]. Intuitively, it quantifies the range of different functions a given model can approximate. Furthermore, the eigenvalue spectrum of the FIM gives insights into the geometry of the models’ parameter space and hence the models’ trainability. Fig. 4 shows both the effective dimension and the eigenvalue spectrum of the FIM for both quantum and classical models using CartPole-v1 states as input. It becomes apparent that the cyclic data-reuploading strategy results in a slight increase in the effective dimension even though the eigenvalues of the FIM are more or less uniformly distributed in all quantum models. Therefore the cyclic data-reuploading strategy was chosen as the quantum model for the experiments. The effective dimension result for the classical model was calculated only for a network with 50 parameters and not for larger networks. This is due to the memory resource bottleneck in calculating the FIM for larger networks. Hence, classical networks consisting of higher number of parameters are considered for the experiments. Details regarding the network architectures are explained in the later sections.

V Results and Discussion

V-A Training on Random Trajectory

The VQC presented in Fig. 3 was used as quantum agents for the discrete BCQQ algorithm to find an optimal policy, which solves the CartPole-v1 environment from a buffer filled with random environment interactions. Table I presents the cumulative validation reward averaged over three training runs each, which were performed for a maximum of 25000 gradient update steps. In addition, training was stopped when the agent reached a cumulative reward of 500 in all ten validation environments to reduce the computational cost of the experiments. All experiments were performed on a quantum simulator with Qiskit API [44]. SPSA-based gradient estimation along with AMSGrad optimizer and a learning rate of 0.01 resulted in the best average reward by the quantum agent. From Table I, it can be seen that the quantum agent with cyclic data re-uploading strategy is able to learn an optimal policy for all buffer sizes, even from just 100 random interactions.

Function Approx. No. of Params. Buffer Size Average Reward
Quantum Agent with Cyclic DRU 42x2 1e6 500
1e4 500
1e2 500
Classical Neural Network 67586x2 1e6 500
1e4 348.55
1e2 69.33
67x2 1e6 342.22
1e4 336.33
1e2 12.88
Table I: Average cumulative reward returned by quantum and classical agents. All results are averaged over three training runs
Refer to caption
Figure 5: Figure (a), (b), and (c) shows the learning curves of the quantum agent with cyclic data re-uploading strategy and different classical agents trained on partial noisy trajectories of length 25, 50, and 100 respectively. The results shown are averaged over 3 training runs with each evaluation consisting of rewards averaged over 10 random seeds.

V-A1 Classical Benchmark

To benchmark the performance of VQCs against classical algorithms, we trained two different classical DNNs using the discrete BCQ algorithm on the datasets explained in the section RL Environment and Offline Data Collection.We chose a fully connected architecture with two hidden layers of 256 or 5 nodes respectively and ReLU activation. Both networks were trained for a maximum of 100000 steps, instead of 25000 steps as in discrete BCQQ training. This increase was motivated by the fact that these classical DNNs can be updated with lower computational cost per update step. However, the early stop** criteria are still used so that the experimental setup remains comparable to the previous one. In the case of the classical network, the Adam optimizer with a learning rate of 0.01 resulted in the highest average reward. The results presented in Table I show that the classical agent trained with discrete BCQ has difficulty learning an optimal policy from small buffers generated using a random policy. The large NN alone was able to learn an optimal policy in the one million samples case. However, even the larger network was unable to converge to an optimal solution when the size of the training buffer was reduced.

V-A2 Globality Testing

From Table I it becomes clear that the VQC with cyclic data re-uploading trained with discrete BCQQ algorithm can learn a policy that attains at least a cumulative reward of 500 for CartPole-v1 environment in all cases. However, it is also interesting to see the maximum cumulative reward beyond 500 that a trained agent can achieve in the CartPole-v1 environment. To this end, we tested the trained agent from the random buffer experiments in the CartPole-v1 environment until either the environment terminated or a cumulative reward of 100000 was achieved. The results of this globality test are shown in Fig. 6. The results show that the VQC with cyclic data re-uploading not only trains to solve the CartPole-v1 environment, but also learns a more stable policy compared to all other agents.

Refer to caption
Figure 6: Globality test results in solving CartPole-v1 environment by Quantum and Classical agents. The quantum agent employs cyclic data re-uploading strategy and 42×242242\times 242 × 2 trainable parameters. The classical agent consists of 67586×267586267586\times 267586 × 2 trainable parameters

V-B Training on Partial Noisy Trajectory

Results from the above sections are useful to estimate the capabilities of a VQC with cyclic data re-uploading as a quantum agent. However, in real-world scenarios, the usefulness of offline RL algorithms shines in problems where an RL environment is not available or agent-environment interactions during training are not possible. Here, the agent should be able to learn and hold on to an optimal policy from the clean (or noisy) data collected using an expert policy alone. For this purpose, we collected buffers of 100, 50, and 25 environment transitions, based on a pre-trained expert policy. Here, we compare the performance of a VQC with cyclic data re-uploading against 4 classical DNNs consisting of 2 hidden layers with 4, 18, 50, and 100 nodes each. Furthermore, to check whether the agent is able to maintain the learned policy, the early stop** condition was not used here. The results from these experiments are presented in Fig. 5. The quantum agents trained on buffers of size 25 and 100 exhibited a stable learning behavior with a learning rate of 0.001 and when trained with a buffer of size 50 showed a stable behavior with a learning rate of 0.01. Parameter-shift rule with the Adam optimizer was better suited for the quantum agents as the SPSA-based gradients could not hold on to the learned optimal policy. However, all the classical agents showed a stable learning behavior with the same optimizer but a learning rate of 0.0003. All the graphs show that the quantum agent was able to successfully learn to solve the CartPole-v1 environment from partial noisy trajectories. The numerical results indicate that the quantum agent exhibits convergence and performance similar to classical agents with 50 to 250 times more parameters in solving CartPole-v1 environment. The classical agent with a comparable number of parameters struggled in learning an optimal policy when the buffer size was reduced. Our current results indicate a slight advantage depicted by BCQQ in terms of data and the number of trainable parameter requirements. Nonetheless, additional experiments with more complex environments, advanced classical networks, and larger VQCs ( in terms of the number of qubits and the number of parameters), etc. are needed to further substantiate the potential of BCQQ over the BCQ method. However, it is beyond the scope of this work and left for the future.

V-C Validation on real quantum hardware

All the experimental results presented in the above sections illustrate that the VQC can learn an optimal policy in solving the CartPole-v1 environment with just 100 samples. However, all the experiments presented until now were performed using an ideal quantum simulator that estimates exact expectation values. Whereas the NISQ devices that are currently available are hardware noise prone and only capable of estimating the expectation values. Here, a given quantum circuit is executed multiple times repeatedly to estimate the expectation values. The number of repetitions of the circuit is commonly known as shots. The higher the number of shots, the more accurate the estimated expectation values. To analyze the minimum number of shots required by the agent and the performance of the quantum agent under the influence of hardware noise, we tested the best-performing quantum agent from section V-B on a noisy simulator and the real quantum hardware. The performance of the agent using different numbers of shots is shown in figure 7.

Refer to caption
Figure 7: Validation results of a quantum agent trained on an ideal simulator and tested on the noisy simulator, quantum hardware, and quantum hardware with zero noise extrapolation type error mitigation.

Validation results with the noisy simulator show that the agent struggles in solving the CartPole-v1 environment using expectation values estimated with 32 or 64 shots. Yet, the agent achieved its peak performance with 128 shots and further increasing the number of shots did not result in a better performance. Here the impact of noise overshadowed the performance gained with the higher number of shots. Further, we also validated the performance of the quantum agent using ”ibmq_mumbai”, ”ibmq_kolkata” and ”ibm_algiers” devices. However, due to the higher computational intensity of validation and the scarce availability of quantum resources, we limited the validation to 64, 256, and 1024 shots. The validation results on the IBMQ device show that the agent was only able to achieve a reward of 90 at the most due to the significant impact of noise. Nonetheless, the agent was able to achieve a high reward of 495 with the aid of the zero noise extrapolation error mitigation technique [45] and 1024 shots. We speculate that the drop in performance on the IBMQ device could be overcome with further training on the real device. However, further training on the NISQ device is beyond the scope of this work and left for the future.

VI Conclusion

In this paper, we propose discrete BCQQ, a batch RL algorithm for discrete action spaces based on VQCs. The key component is a cyclic data re-uploading scheme, where the encoding layers are not only repeated sequentially throughout the circuit, but the order in which the input feature is encoded in which qubit is cyclically shifted from layer to layer. This leads to a slight increase in the model’s effective dimension, improving its expressivity without increasing the number of model parameters. Experiments in the CartPole-v1 environment demonstrate that the discrete BCQQ can learn offline from partial trajectories. The results also indicate that the BCQQ algorithm may show better generalization and learning capabilities when learning from smaller data compared to classical networks. For future work, we want to investigate how discrete BCQQ scales with more complex VQCs and evaluate its performance in challenging environments. In addition, we expect that cyclic data re-uploading may lead to improvements in other quantum-based machine learning tasks, such as classification, but leave it for future work.

VII Acknowledgement

We thank Dr. Georgios Kontes (Fraunhofer IIS, Nürnberg, Germany), Dr. Steffen Udluft (Siemens AG Technology, Munich, Germany), and Dr. Daniel Hein (Siemens AG Technology, Munich, Germany) for helpful discussions about reinforcement learning. This work was supported by the German Federal Ministry of Education and Research (BMBF), funding program “quantum technologies from basic research to market”, grant number 13N15645.

References

  • [1] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel et al., “A general reinforcement learning algorithm that masters chess, shogi, and go through self-play,” Science, vol. 362, no. 6419, pp. 1140–1144, 2018.
  • [2] A. Fawzi, M. Balog, A. Huang, T. Hubert, B. Romera-Paredes, M. Barekatain, A. Novikov, F. J. R Ruiz, J. Schrittwieser, G. Swirszcz et al., “Discovering faster matrix multiplication algorithms with reinforcement learning,” Nature, vol. 610, no. 7930, pp. 47–53, 2022.
  • [3] R. Agarwal, D. Schuurmans, and M. Norouzi, “Striving for simplicity in off-policy deep reinforcement learning,” 2020. [Online]. Available: https://openreview.net/forum?id=ryeUg0VFwr
  • [4] O. Kiss, M. Grossi, P. Lougovski, F. Sanchez, S. Vallecorsa, and T. Papenbrock, “Quantum computing of the li 6 nucleus via ordered unitary coupled clusters,” Physical Review C, vol. 106, no. 3, p. 034325, 2022.
  • [5] P. J. O’Malley, R. Babbush, I. D. Kivlichan, J. Romero, J. R. McClean, R. Barends, J. Kelly, P. Roushan, A. Tranter, N. Ding et al., “Scalable quantum simulation of molecular energies,” Physical Review X, vol. 6, no. 3, p. 031007, 2016.
  • [6] W. J. Yun, J. P. Kim, S. Jung, J.-H. Kim, and J. Kim, “Quantum multi-agent actor-critic neural networks for internet-connected multi-robot coordination in smart factory management,” IEEE Internet of Things Journal, 2023.
  • [7] M. C. Caro, H. Huang, M. Cerezo et al., “Generalization in quantum machine learning from few training data,” Nat Commun 13, vol. 4919, 2022.
  • [8] N. Meyer, C. Ufrecht, M. Periyasamy, D. D. Scherer, A. Plinge, and C. Mutschler, “A survey on quantum reinforcement learning,” arXiv preprint arXiv:2211.03464, 2022.
  • [9] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” CoRR, vol. abs/1606.01540, 2016. [Online]. Available: http://arxiv.longhoe.net/abs/1606.01540
  • [10] S. Fujimoto, E. Conti, M. Ghavamzadeh, and J. Pineau, “Benchmarking batch deep reinforcement learning algorithms,” arXiv preprint arXiv:1910.01708, 2019.
  • [11] M. Van Otterlo and M. Wiering, “Reinforcement learning and markov decision processes,” Reinforcement learning: State-of-the-art, pp. 3–42, 2012.
  • [12] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.   MIT press, 2018.
  • [13] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015.
  • [14] M. Schuld, A. Bocharov, K. M. Svore, and N. Wiebe, “Circuit-centric quantum classifiers,” Physical Review A, vol. 101, no. 3, p. 032308, 2020.
  • [15] S. Y.-C. Chen, C.-M. Huang, C.-W. Hsing, and Y.-J. Kao, “Hybrid quantum-classical classifier based on tensor network and variational quantum circuit,” arXiv preprint arXiv:2011.14651, 2020.
  • [16] A. Blance and M. Spannowsky, “Quantum machine learning for particle physics using a variational quantum classifier,” Journal of High Energy Physics, vol. 2021, no. 2, pp. 1–20, 2021.
  • [17] M. A. Nielsen and I. L. Chuang, Quantum computation and quantum information.   Cambridge university press, 2010.
  • [18] K. Mitarai, M. Negoro, M. Kitagawa, and K. Fujii, “Quantum circuit learning,” Physical Review A, vol. 98, no. 3, p. 032309, 2018.
  • [19] M. Schuld, R. Sweke, and J. J. Meyer, “Effect of data encoding on the expressive power of variational quantum-machine-learning models,” Physical Review A, vol. 103, no. 3, p. 032430, 2021.
  • [20] F. J. Gil Vidal and D. O. Theis, “Input redundancy for parameterized quantum circuits,” Frontiers in Physics, vol. 8, p. 297, 2020.
  • [21] A. Skolik, S. Jerbi, and V. Dunjko, “Quantum agents in the gym: a variational quantum algorithm for deep q-learning,” Quantum, vol. 6, p. 720, 2022.
  • [22] M. Schuld, V. Bergholm, C. Gogolin, J. Izaac, and N. Killoran, “Evaluating analytic gradients on quantum hardware,” Phys. Rev. A, vol. 99, p. 032331, Mar 2019. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevA.99.032331
  • [23] A. Pellow-Jarman, I. Sinayskiy, A. Pillay, and F. Petruccione, “A comparison of various classical optimizers for a variational quantum linear solver,” Quantum Information Processing, vol. 20, no. 6, p. 202, 2021.
  • [24] X. Bonet-Monroig, H. Wang, D. Vermetten, B. Senjean, C. Moussa, T. Bäck, V. Dunjko, and T. E. O’Brien, “Performance comparison of optimization methods on variational quantum algorithms,” arXiv preprint arXiv:2111.13454, 2021.
  • [25] I. Miháliková, M. Friák, M. Pivoluska, M. Plesch, M. Saip, and M. Šob, “Best-practice aspects of quantum-computer calculations: A case study of the hydrogen molecule,” Molecules, vol. 27, no. 3, p. 597, 2022.
  • [26] S. J. Reddi, S. Kale, and S. Kumar, “On the convergence of adam and beyond,” arXiv preprint arXiv:1904.09237, 2019.
  • [27] M. Wiedmann, M. Hölle, M. Periyasamy, N. Meyer, C. Ufrecht, D. D. Scherer, A. Plinge, and C. Mutschler, “An empirical comparison of optimizers for quantum machine learning with spsa-based gradients,” arXiv preprint arXiv:2305.00224, 2023.
  • [28] D. Dong, C. Chen, and Z. Chen, “Quantum reinforcement learning,” in Advances in Natural Computation: First International Conference, ICNC 2005, Changsha, China, August 27-29, 2005, Proceedings, Part II 1.   Springer, 2005, pp. 686–689.
  • [29] D. Dong, C. Chen, H. Li, and T.-J. Tarn, “Quantum reinforcement learning,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 38, no. 5, pp. 1207–1220, 2008.
  • [30] L. K. Grover, “A fast quantum mechanical algorithm for database search,” in Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, 1996, pp. 212–219.
  • [31] S. Y.-C. Chen, C.-H. H. Yang, J. Qi, P.-Y. Chen, X. Ma, and H.-S. Goan, “Variational quantum circuits for deep reinforcement learning,” IEEE Access, vol. 8, pp. 141 007–141 024, 2020.
  • [32] Z. Cheng, K. Zhang, L. Shen, and D. Tao, “Offline quantum reinforcement learning in a conservative manner,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 6, 2023, pp. 7148–7156.
  • [33] S. Fujimoto, D. Meger, and D. Precup, “Off-policy deep reinforcement learning without exploration,” in International conference on machine learning.   PMLR, 2019, pp. 2052–2062.
  • [34] S. Thrun and A. Schwartz, “Issues in using function approximation for reinforcement learning,” in Proceedings of the Fourth Connectionist Models Summer School, vol. 255.   Hillsdale, NJ, 1993, p. 263.
  • [35] S. Levine, A. Kumar, G. Tucker, and J. Fu, “Offline reinforcement learning: Tutorial, review, and perspectives on open problems,” arXiv preprint arXiv:2005.01643, 2020.
  • [36] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” in Proceedings of the AAAI conference on artificial intelligence, vol. 30, 2016.
  • [37] M. Franz, L. Wolf, M. Periyasamy, C. Ufrecht, D. D. Scherer, A. Plinge, C. Mutschler, and W. Mauerer, “Uncovering instabilities in variational-quantum deep q-networks,” Journal of The Franklin Institute, 2022.
  • [38] N. Meyer, D. Scherer, A. Plinge, C. Mutschler, and M. Hartmann, “Quantum policy gradient algorithm with optimized action decoding,” in International Conference on Machine Learning.   PMLR, 2023, pp. 24 592–24 613.
  • [39] A. Pérez-Salinas, A. Cervera Lierta, E. Gil-Fuster, and J. Latorre, “Data re-uploading for a universal quantum classifier,” Quantum, vol. 4, p. 226, 02 2020.
  • [40] M. Periyasamy, N. Meyer, C. Ufrecht, D. D. Scherer, A. Plinge, and C. Mutschler, “Incremental data-uploading for full-quantum classification,” in 2022 IEEE International Conference on Quantum Computing and Engineering (QCE).   IEEE, 2022, pp. 31–37.
  • [41] O. Berezniuk, A. Figalli, R. Ghigliazza, and K. Musaelian, “A scale-dependent notion of effective dimension,” arXiv preprint arXiv:2001.10872, 2020.
  • [42] A. Abbas, D. Sutter, C. Zoufal, A. Lucchi, A. Figalli, and S. Woerner, “The power of quantum neural networks,” Nature Computational Science, vol. 1, no. 6, pp. 403–409, 2021.
  • [43] J. Rissanen, “Fisher information and stochastic complexity,” IEEE Transactions on Information Theory, vol. 42, no. 1, pp. 40–47, 1996.
  • [44] Qiskit contributors, “Qiskit: An open-source framework for quantum computing,” 2023.
  • [45] K. Temme, S. Bravyi, and J. M. Gambetta, “Error mitigation for short-depth quantum circuits,” Physical Review Letters, vol. 119, no. 18, Nov. 2017. [Online]. Available: http://dx.doi.org/10.1103/PhysRevLett.119.180509