\NewEnviron

scaletikzpicturetowidth[1]\BODY

BCQQ: Batch-Constraint Quantum Q-Learning with Cyclic Data Re-uploading

Maniraman Periyasamy, Marc Hölle, Marco Wiedmann, Daniel D. Scherer, Axel Plinge, Christopher Mutschler Fraunhofer IIS, Fraunhofer Institute for Integrated Circuits IIS, Nuremberg, Germany

Abstract

Deep reinforcement learning (DRL) often requires a large number of data and environment interactions, making the training process time-consuming. This challenge is further exacerbated in the case of batch RL, where the agent is trained solely on a pre-collected dataset without environment interactions. Recent advancements in quantum computing suggest that quantum models might require less data for training compared to classical methods. In this paper, we investigate this potential advantage by proposing a batch RL algorithm that utilizes variational quantum circuits (VQCs) as function approximators within the discrete batch-constraint deep Q-learning (BCQ) algorithm. Additionally, we introduce a novel data re-uploading scheme by cyclically shifting the order of input variables in the data encoding layers. We evaluate the efficiency of our algorithm on the OpenAI CartPole environment and compare its performance to the classical neural network-based discrete BCQ.

Index Terms:

quantum reinforcement learning, batch reinforcement learning, variational quantum computing, data uploading, data re-uploading, batch quantum reinforcement learning, offline quantum reinforcement learning.

I Introduction

The challenge of applying reinforcement learning (RL) in real-world problems lies in its training process. In contrast to fully data-driven machine learning (ML) procedures such as supervised learning, RL learns via environment interactions. An RL agent follows a policy and chooses actions that change the state of the environment, for which it receives a reward (possibly after each interaction). The agent’s objective is to learn a policy that maximizes the long-term reward. Using this framework, agents based on deep neural networks (DNNs) have been remarkably successful in a variety of complex tasks, including super-human performance in computationally-hard board games [1] and discovering fast matrix multiplication algorithms [2].

Unfortunately, this interactive approach is not feasible in many safety-critical scenarios, that potentially benefit from RL, e.g., robotics or healthcare. While in general it is possible to train RL agents in a simulator, it is often non-trivial to deploy them in the real world due to the domain gap between simulation and reality. In these cases, it would be beneficial to utilize real-world data, gathered by an expert operator, and train in a purely data-driven, offline fashion. However, current offline algorithms require large datasets to match the performance of algorithms trained with environment interactions [3]. This is an impediment in domains where only limited data is available. One of the main reasons for this performance loss is that the distribution of states and actions during testing can drastically differ from the training data. Although RL research heavily tackles this lately, these challenges still remain mostly unresolved.

With the advent of the first practical quantum computers, so-called noisy intermediate scale quantum (NISQ) devices, it is natural to investigate whether this new computing paradigm could be leveraged to improve RL. Factors like low gate fidelity and coherence times of these NISQ devices cause applied research to focus on hybrid quantum-classical schemes such as variational quantum circuits (VQCs) [4, 5, 6]. These can be considered as the quantum analog to classical DNNs and are therefore also called quantum neural networks. Theoretical work indicates that VQCs are more data-efficient than classical ML methods [7]. As data can become a bottleneck when learning offline, we are interested in investigating if this theoretical advantage can be turned into a practical performance gain. This would translate to a quantum reinforcement learning (QRL) algorithm [8] that can learn a policy from a small dataset and outperform a classical policy trained on the same data. Currently, the limited number of qubits together with the corresponding hardware topology and the complexity of numerical simulations make it intractable to benchmark QRL algorithms on state-of-the-art environments on a large scale. However, proof-of-concept experiments can be executed in low-complexity environments such as OpenAI’s CartPole [9].

Our contribution in this paper is two-fold. First, we show how to apply function approximation within the discrete batch-constraint deep Q-learning algorithm [10] with VQCs, where we find a performance advantage over the classical counterpart in CartPole. Near-optimal performance can be achieved in a low data regime, where a classical agent with similar number of parameters that is able to solve CartPole in an online setting, fails offline. Second, we present a cyclic data re-uploading scheme that proved to be advantageous in this batch RL context and analyze the effective dimension of the resulting VQC.

II Theoretical Background

II-A General Framework of Reinforcement Learning

On an abstract level, RL can be modelled in the framework of Markov-Decision-Problems (MDPs) [11], in which an agent interacts with its environment. An MDP is defined by a set of states $\mathcal{S}$ , actions $\mathcal{A}$ , reward function $R:\mathcal{S}\times\mathcal{A}\rightarrow\mathds{R}$ and discount factor $\gamma\in[0,1]$ . The goal is to find an optimal policy, that assigns each action $a\in\mathcal{A}$ a probability $\pi(a|s)$ with which the agent should take that particular action, given that the environment is in state $s\in\mathcal{S}$ . In this case, optimality means that the policy should maximize the expected, discounted reward for any given initial state $s_{0}\in\mathcal{S}$ , i.e.

\pi^{*}\in\underset{\pi}{\mathrm{argmax}}\sum_{t=0}^{\infty}\underset{\begin{% subarray}{c}a_{t}\sim\pi\\ s_{t+1}\sim T\end{subarray}}{\mathds{E}}\gamma^{t}R(s_{t},a_{t}).

(1)

One usually defines a state-action value function or Q-function

Q_{\pi}(s,a)=\underset{s_{0}\sim T}{\mathds{E}}\left(R(a,s)+\gamma\sum_{t=0}^{% \infty}\underset{\begin{subarray}{c}a_{t}\sim\pi\\ s_{t+1}\sim T\end{subarray}}{\mathds{E}}\gamma^{t}R(s_{t},a_{t})\right)

(2)

as the expected, discounted cumulative reward from choosing action $a$ in state $s$ and following the policy $\pi$ from there on. It can be shown that the Q-function of an optimal policy must satisfy a recursive relation known as the Bellman optimality equation [12]

Q^{*}(s,a)=\underset{s^{\prime}\sim T}{\mathds{E}}\left(R(s,a)+\gamma\underset% {a^{\prime}\in\mathcal{A}}{\text{max}}Q^{*}(s^{\prime},a^{\prime})\right).

(3)

On the other hand, given the optimal Q-function $Q^{*}$ one can represent the corresponding optimal policy as

\pi^{*}(a|s)=\frac{1}{\mathcal{N}}\begin{cases}1\text{ , if }a\in\underset{a^{% \prime}\in\mathcal{A}}{\text{argmax }}Q^{*}(s,a^{\prime})\\ 0\text{ , else}\end{cases}

(4)

where $\mathcal{N}$ is a normalization factor assuring that all probabilities sum to unity.

II-B Online vs. Batch and On-Policy vs. Off-Policy RL

The task of learning an optimal policy can be approached in different ways. One fundamental distinction between RL algorithms is the kind of data used in the training process. If the learning algorithm interacts with the environment during the training process, it is called an online algorithm. However, this does not mean that the algorithm needs to be on-policy, i.e. exclusively use it’s current estimate of an optimal policy to gather data. Off-policy algorithms on the other hand often employ explorative policies to gather experience from the environment and train a separate policy that should eventually solve the given task. In contrast to this, batch or offline algorithms only need access to data that was collected from the environment beforehand. In principle, any off-policy learning algorithm can be used for batch RL, but it requires special care to avoid problems during training [10]. Many offline RL approaches rely on Q-learning with deep Q-networks (DQN), as explained in the next chapter.

II-C Deep Q-Learning

Q-learning relies on the fact that an optimal policy is induced by a Q-function that solves the Bellman equation (3). A popular approach for large-scale problems is DQN, which uses DNNs as parametrized function approximators for $Q^{*}$ [13]. They can be trained on a loss function that is derived from the Bellman equation (3).

l(\theta)=\underset{M}{\mathds{E}}\left(r+\gamma\underset{a^{\prime}\in% \mathcal{A}}{\text{max}}Q_{\theta^{\prime}}(s^{\prime},a^{\prime})-Q_{\theta}(% s,a)\right)^{2}

(5)

where, $M$ is the mini-batch sampled from buffer $\mathcal{B}$ consisting of transitions $(s,a,r,s^{\prime})$ .

II-D Variational Quantum Circuits for RL

VQCs have gained a lot of attention from the quantum computing community in the recent years due to their NISQ feasibility. It has already been established, that they are a potentially powerful platform for quantum-enhanced ML [14, 15, 16]. While any quantum computation can be decomposed into a sequence of quantum gates, the power of VQCs arises from the fact that some of these gates are parameterized by a continuous variable. For example, any single qubit gate can be expressed as a rotation in 3D space acting on the Bloch vector and is therefore parameterized by the three Euler angles corresponding to this rotation. This allows us to use quantum circuits as parameterized function approximators. For a detailed introduction into the framework and theory of quantum computing, we refer the reader to [17].

A VQC that represents a function $f_{\theta}(x)$ is composed of three basic components [18]: 1. The data encoding. The unitary $U(x)$ encodes the input data $x$ into a quantum state $\ket{\psi(x)}=U(x)\ket{0}$ . 2. The variational layers. A different unitary $U(\theta)$ maps the input state $\ket{\psi(x)}$ onto the output state $\ket{\phi(x,\theta)}=U(\theta)\ket{\psi(x)}$ . A common approach is to decompose $U(\theta)$ into a set of repeated layers, which contain both parameterized rotations and entangling gates. 3. The measurement. An observable $O$ is chosen and its expectation value is estimated over several runs of the circuit. The result represents the output of the VQC

f_{\theta}(x)=\bra{\phi(x,\theta)}O\ket{\phi(x,\theta)}.

(6)

It has been shown in [19, 20] that a given VQC actually realizes a specific Fourier sum

f_{\theta}(x)=\sum_{\omega\in\Omega}c_{\omega}(\theta)e^{i\omega x},

(7)

where the available frequency spectrum $\Omega$ is determined by the data encoding. Repeating the data encoding, i.e. the action of the unitary $U(x)$ , multiple times throughout the circuit (cf. section Data Re-Uploading) enlarges the available spectrum. Thus, a broader class of functions can be accessed through this so-called data re-uploading.

To take this analogy even further: Just like NNs, VQCs can be trained from a set of samples $\{(x,f(x))\}$ of the desired function. The optimization of the parameters $\theta$ in the training phase of the VQC is carried out on a classical computer, which may evaluate the VQC as a quantum subroutine. The hope is that VQCs are able to access a broader class of functions with fewer parameters and enable more data efficient learning. Since the full action of the unitaries $U(x)$ and $U(\theta)$ is believed to be hard to simulate classically for appropriately designed VQCs, quantum computers might provide an advantage in machine learning tasks.

In the context of QRL, a common approach is to employ VQCs instead of NNs to represent the Q-function in DQN. The state of the environment is used as the input to the VQC. Each available action $a\in\mathcal{A}$ is assigned an observable $O_{a}$ , such that the Q-function estimate is given by

Q_{\theta}(a|s)=\bra{\phi(s,\theta)}O_{a}\ket{\phi(s,\theta)}

(8)

However, since the output of the VQC is the quantum mechanical expectation value of an observable (ref. equation (6)), the range of possible output values is quite limited. Popular choices for the observables $O_{a}$ are combinations of single-qubit Pauli matrices. This restricts the output of the VQCs to the interval $[-1,1]$ . However, the true Q-function might be beyond this range. To address this, one can simply scale each expectation value by a classical weight $w_{a}$ , which is also inferred from the training process [21].

II-E Efficient Gradient Estimation on Quantum Devices

Since computing an explicit representation of the unitary $U(\theta)$ of the VQC is classically hard, one cannot efficiently compute the gradient of the VQC output with respect to the parameters $\theta$ with classical techniques. Although there is a way of obtaining the gradient directly from the quantum device with the parameter-shift rule [22], this approach still suffers from the fact that it requires $2p$ expectation value estimations for a circuit with $p$ parameters, which quickly becomes intractable even for medium sized circuits due to the large number of required circuit runs. This is why gradient-free optimization schemes, like simultaneous perturbation stochastic approximation (SPSA), which always uses only two calls to the quantum subroutine for each update step, have become of interest to the QML community. It is especially well suited for noisy objective functions which arise when training VQCs [23, 24, 25]. It computes an approximate gradient, that can be fed to state-of-the-art gradient based optimizers like AMSGrad [26] to efficiently train medium-sized VQCs [27]. However, we suspect that this method introduces instabilities in the training of offline RL agents, when no appropriate early-stop** criterion is available.

III Related Work

The following section starts with a short summary of QRL, followed by a survey of relevant batch RL literature.

III-A Quantum Reinforcement Learning

The use of quantum computing for RL is an emerging field. Dong et al. [28, 29] were among its pioneers, proposing an algorithm that learns a state-value function by using the Grover search algorithm [30]. More recently, Chen et al. [31] proposed one of the first VQC-based QRL algorithms, by replacing neural networks in the DQN algorithm with VQCs. For a detailed picture of QRL we refer to Meyer et al. [8]. Upon completion of our work, we came across the works of Cheng et al. [32], which uses VQCs in conservative Q-learning for offline RL. There the authors showed that a quantum agent can learn to solve an environment in an offline fashion. However, the authors did not investigate the performance of quantum agents on noisy data or partial trajectories.

III-B Batch Reinforcement Learning

When RL is utilized in scenarios where humans or equipment can be harmed, the agent is usually trained in a simulated environment beforehand. Consequently, the agent’s performance might suffer from limited modelling of real-world processes and creating these simulations causes an additional overhead. To resolve this, the objective of batch RL is to learn an optimal policy from a set of real-world data. This training data is gathered by a behavior policy (e.g. expert human operator) interacting with the environment. As an extreme case, one can even consider a behavior policy selecting actions at random. Off-policy algorithms such as DQN are similar in spirit to offline learning, since they are in principle agnostic to how the experience was gathered. However, Fujimoto et al. [33] found that Q-value estimates of DQN diverge in an offline setting, leading to the offline trained algorithms performing worse than the same algorithms trained on the same dataset in an online manner. Agarwal et al. [3] demonstrated, that offline trained agents can reach comparable performance on the OpenAI Atari 2600 games [9], but a very large (50 million environment interactions) and diverse dataset was used. This can be explained by the fact, that algorithms like DQN employ a replay buffer of past environment interactions, which causes the gathered data to be correlated to the current policy [13]. In offline RL, this correlation is not present and there is a distributional shift between training and testing. Furthermore, out-of-distribution states may arise, if during testing the agent encounters a state that was not part of its training data.

Additionally, Q-learning-based algorithms like DQN tend to suffer from overestimating Q-values, due to the objective of maximizing the expected return [34]. In the online scenario, this is not as severe, due to corrective feedback from the environment during training. However, since this feedback is missing in an offline setting, this overestimation bias together with the distributional shift causes the policy to extrapolate poorly to unfamiliar states. A problem to which Fujimoto et al. [33] refer to as extrapolation error. Approaches trying to alleviate this error typically introduce some constraint on the policy to keep it close to the behavior policy. Typically, closeness is determined either directly with respect to a probability metric or penalty terms in the policy update [35].

III-C Batch-Constraint Deep Q-Learning

An algorithm implicitly following the first approach is batch-constraint deep Q-Learning (BCQ) [33]. The key idea is that in order to avoid the distributional shift, a trained policy should induce a similar state-action visitation to what is observed in the batch. In the following, such policies are called batch-constrained. To achieve this, BCQ uses a generative model $G_{\omega}$ to preselect likely actions according to the batch. The policy is only allowed to choose from this preselection. In the following, we will restrict the discussion of BCQ to the discrete action setting. In this case, the generative model can be understood as a map $G_{\omega}:\mathcal{S}\rightarrow\Delta\left(\mathcal{A}\right)$ that takes the current environment state as input and outputs the probability with which each action would occur in the batch. In particular, if the batch is filled using transitions from a policy $\pi_{b}$ then the generative model should reconstruct this policy, i.e. $G_{\omega}(a|s)\approx\pi_{b}(a|s)$ . From this, one can preselect the actions by discarding actions whose probability relative to the most likely one is below a threshold $\tau$

\tilde{\mathcal{A}}(s)=\left\{a\in\mathcal{A}\Bigg{|}\frac{G_{\omega}(a|s)}{% \underset{\hat{a}\in\mathcal{A}}{\text{max}}G_{\omega}(\hat{a}|s)}>\tau\right\}.

(9)

Both the policy

\pi_{\theta}(a|s)=\frac{1}{\mathcal{N}}\begin{cases}1\text{ , if }a\in% \underset{a^{\prime}\in\tilde{\mathcal{A}}(s)}{\text{argmax}}Q_{\theta}(s,a^{% \prime})\\ 0\text{ , else}\end{cases}

and the target in the loss function

l(\theta)=\underset{(s,a,r,s^{\prime})\in\mathcal{B}}{\mathds{E}}\left(r+% \gamma\underset{a^{\prime}\in\tilde{\mathcal{A}}(s^{\prime})}{\text{max}}Q_{% \theta^{\prime}}(s^{\prime},a^{\prime})-Q_{\theta}(s,a)\right)^{2}\\

are updated to only consider this preselection of actions. The generative model itself is trained with a standard cross-entropy loss

l(\omega)=-\sum_{(s,a)\in\mathcal{B}}\text{log}\left(G_{\omega}(a|s)\right).

Additionally, to address the overestimation bias of Q-learning towards underrepresented transitions, a technique called Double DQN [36] is employed. Instead of selecting the maximal action with respect to the target network in the Q-learning target, the maximal action with respect to the current Q-network is chosen, but it is still evaluated using the target network. The corresponding loss function is

	$\displaystyle l(\theta)=\underset{(s,a,r,s^{\prime})\in\mathcal{B}}{\mathds{E}% }\left(r+\gamma Q_{\theta^{\prime}}(s^{\prime},a^{\prime})-Q_{\theta}(s,a)% \right)^{2}$	(10)
where
	$\displaystyle a^{\prime}\in\underset{\tilde{a}\in\tilde{\mathcal{A}}(s^{\prime% })}{\text{argmax}}Q_{\theta}(s^{\prime},\tilde{a}).$

In this work, we apply the variational quantum deep Q-networks (VQ-DQN) proposed by [37] to get an offline QRL algorithm which we call batch-constraint quantum Q-learning (BCQQ).

IV BCQQ

Classical batch RL methods like discrete BCQ often struggle to learn an optimal policy in scenarios where high-quality training samples are unavailable in abundance. Loosely interpreted, this behavior suggests that function approximation via classical neural networks requires a large amount of data to effectively approximate a policy. However, the VQCs have shown some indication of approximating a function from far fewer samples in the supervised learning context [7]. Therefore, we constructed and conducted a series of experiments to study the capabilities of a VQCs in learning a policy in the batch RL setup.

IV-A RL Environment and Offline Data Collection

The capabilities of a VQC in learning an optimal policy in an online fashion to solve CartPole environments have already been studied in various instances [21, 37]. Therefore, we chose the CartPole-v1 environment from the OpenAI gym [9] as the target environment for all the performed experiments. The offline datasets solving real-world problems often do not have trajectories from the optimal policy for a given setup. To mimic the worst-case scenario, we experimented with the most adverse situation, where the data buffer contains trajectories only from a random policy interacting with the CartPole-v1 environment. Buffers with $10^{2}$ , $10^{4}$ , and $10^{6}$ samples were collected using the above-mentioned setup. Additionally, we study how well the offline agents are able to learn from an expert policy. To avoid pure imitation, the trajectories gathered by the expert were artificially corrupted with noise, by letting the expert choose a random action with a low probability. Buffers with $100$ , $50$ , and $25$ samples were collected using using noisy-expert policy for this experiment.

IV-B Variational Quantum Circuit

Refer to caption — Figure 1: The VQC that is used as the function approximator for the BCQQ algorithm. Note: Each $\vec{\theta}$ block represents the repetition of the variational layer ansatz with different trainable parameters.

The VQC used as the function approximator in the BCQQ is shown in Fig. 1. A four-qubit quantum system was chosen as the target system as the CartPole-v1 environment has a four-dimensional state space. Here, each feature of the observation is encoded into the VQC using a single qubit Rx gate on each qubit. The variational block comprises five layers containing four parameterized Ry, and four parameterized Rz gates each. In addition to the parameterized rotational gates, each layer also includes two-qubit CZ entanglement gates with nearest-neighbor connectivity in the circuit layout. We chose the nearest-neighbor connectivity in the circuit layout as this is one of the most commonly available quantum hardware topologies. The CartPole-v1 has an action space of length two. Therefore, the expectation value of the Pauli- $ZZ$ observable on qubits 1 and 2 and Pauli- $ZZ$ observable on qubits 3 and 4 was used to decode the Q-values from the VQC. It is to be noted that the encoding scheme, VQC ansatz, and the decoding schemes used are simple design choices based on previous works [37, 38]. Different combinations of parameter-shift and SPSA-based gradient estimators along with Adam and AMSGrad optimizers were tested for optimizing the trainable parameters.

IV-B1 Data Re-Uploading

Data re-uploading [39] is an encoding strategy where the encoding scheme is repeated throughout the VQC. The ordering of the encoding scheme and the trainable layers in a standard VQC is shown in Fig. 1. Schuld et al. [22] show that the more the encoding layer present in the VQC, the larger the frequency spectrum captured by the VQC. Hence, the encoding scheme was re-introduced before every variational layer. Fig. 2 represents the circuit generated using the data re-uploading method.

Figure 2: Quantum agent with standard data re-uploading strategy

IV-B2 Cyclic Data Re-Uploading

It has been established that spreading encoding gates for the feature vector of a given data point throughout the quantum circuit results in a better representation of the data [40]. Inspired by this, we decided to expose each qubit to all the features of the current input state. We achieve this by slightly modifying the data re-uploading strategy explained in the section on Data Re-Uploading. Contrary to the standard approach, we re-introduced the encoding scheme where the input feature vector is shifted one step in a round-robin fashion. Therefore, we call this type of data re-uploading ”cyclic data re-uploading”. This type of encoding scheme has not been explored in the literature before to the best of our knowledge. Fig. 3 represents the circuit generated using the cyclic data re-uploading method.

IV-C Discrete Batch-Constraint Quantum Q-Learning

As explained in the sections Introduction and Theoretical Background, this work aims to study the advantages gained by using VQCs as function approximators in the discrete BCQ algorithm to learn an optimal policy for solving a given environment. For the same reason, we replaced the generative model $G_{\omega}$ and the model approximating the optimal policy with two trainable VQCs, as explained in the section Variational Quantum Circuits for RL. Since we encode data via single qubit rotational gates, each entry of the observation vectors in the dataset is normalized using the encoding scheme presented by [37]. The overall discrete batch-constraint quantum Q-learning algorithm is summarized in Algorithm 1. All the experiments in this study were conducted using the following common hyper-parameters: discount factor $\gamma=0.99$ , threshold $\tau=0.3$ , and mini-batch size of 32. Three different learning rates $\alpha=[0.01,0.001,0.0003]$ were tested for each experiment as the classical and quantum models might need different learning rates for optimal learning on similar problem setups.

Algorithm 1 Discrete BCQQ training algorithm

Normalize dataset

\mathcal{B}

between

[-\pi,\pi]

Initialize encoding unitary

U(\cdot)

Initialize Q value approximator using VQC with

\theta

and

\theta^{\prime}

Initialize generative model

G_{\omega}

using VQC with

\omega

while training not converged do

Sample mini-batch

M

from

\mathcal{B}

for all

(s,a,r,s^{\prime})\in M

Get batch-constraint actions

\tilde{\mathcal{A}}(s^{\prime})

from (9)

Collect

a^{\prime}\in\underset{\tilde{a}\in\tilde{\mathcal{A}}(s^{\prime})}{\text{% argmax}}Q_{\theta}(|\psi(s^{\prime})\rangle,\tilde{a})

end for

Optimize

\theta

w.r.t.

l(\theta)

\underset{M}{\mathds{E}}\left(r+\gamma Q_{\theta^{\prime}}(|\psi(s^{\prime})% \rangle,a^{\prime})-Q_{\theta}(|\psi(s)\rangle,a)\right)^{2}

Optimize

\omega

w.r.t

l(\omega)=-\underset{M}{\mathds{E}}\text{log}\left(G_{\omega}(a,|\psi(s)% \rangle)\right)

if target VQC update then

\theta^{\prime}\leftarrow\theta

end if

end while

IV-D Model selection

In order to understand the difference between cyclic and standard data re-uploading, we analyze the effective dimension of the resulting models. The effective dimension captures the expressivity of an ML model and is based on the empirical FIM [41, 42, 43]. Intuitively, it quantifies the range of different functions a given model can approximate. Furthermore, the eigenvalue spectrum of the FIM gives insights into the geometry of the models’ parameter space and hence the models’ trainability. Fig. 4 shows both the effective dimension and the eigenvalue spectrum of the FIM for both quantum and classical models using CartPole-v1 states as input. It becomes apparent that the cyclic data-reuploading strategy results in a slight increase in the effective dimension even though the eigenvalues of the FIM are more or less uniformly distributed in all quantum models. Therefore the cyclic data-reuploading strategy was chosen as the quantum model for the experiments. The effective dimension result for the classical model was calculated only for a network with 50 parameters and not for larger networks. This is due to the memory resource bottleneck in calculating the FIM for larger networks. Hence, classical networks consisting of higher number of parameters are considered for the experiments. Details regarding the network architectures are explained in the later sections.

V Results and Discussion

V-A Training on Random Trajectory

The VQC presented in Fig. 3 was used as quantum agents for the discrete BCQQ algorithm to find an optimal policy, which solves the CartPole-v1 environment from a buffer filled with random environment interactions. Table I presents the cumulative validation reward averaged over three training runs each, which were performed for a maximum of 25000 gradient update steps. In addition, training was stopped when the agent reached a cumulative reward of 500 in all ten validation environments to reduce the computational cost of the experiments. All experiments were performed on a quantum simulator with Qiskit API [44]. SPSA-based gradient estimation along with AMSGrad optimizer and a learning rate of 0.01 resulted in the best average reward by the quantum agent. From Table I, it can be seen that the quantum agent with cyclic data re-uploading strategy is able to learn an optimal policy for all buffer sizes, even from just 100 random interactions.

Function Approx.	No. of Params.	Buffer Size	Average Reward
Quantum Agent with Cyclic DRU	42x2	1e6	500
		1e4	500
		1e2	500
Classical Neural Network	67586x2	1e6	500
		1e4	348.55
		1e2	69.33
	67x2	1e6	342.22
		1e4	336.33
		1e2	12.88

Table I: Average cumulative reward returned by quantum and classical agents. All results are averaged over three training runs

V-A1 Classical Benchmark

To benchmark the performance of VQCs against classical algorithms, we trained two different classical DNNs using the discrete BCQ algorithm on the datasets explained in the section RL Environment and Offline Data Collection.We chose a fully connected architecture with two hidden layers of 256 or 5 nodes respectively and ReLU activation. Both networks were trained for a maximum of 100000 steps, instead of 25000 steps as in discrete BCQQ training. This increase was motivated by the fact that these classical DNNs can be updated with lower computational cost per update step. However, the early stop** criteria are still used so that the experimental setup remains comparable to the previous one. In the case of the classical network, the Adam optimizer with a learning rate of 0.01 resulted in the highest average reward. The results presented in Table I show that the classical agent trained with discrete BCQ has difficulty learning an optimal policy from small buffers generated using a random policy. The large NN alone was able to learn an optimal policy in the one million samples case. However, even the larger network was unable to converge to an optimal solution when the size of the training buffer was reduced.

V-A2 Globality Testing

From Table I it becomes clear that the VQC with cyclic data re-uploading trained with discrete BCQQ algorithm can learn a policy that attains at least a cumulative reward of 500 for CartPole-v1 environment in all cases. However, it is also interesting to see the maximum cumulative reward beyond 500 that a trained agent can achieve in the CartPole-v1 environment. To this end, we tested the trained agent from the random buffer experiments in the CartPole-v1 environment until either the environment terminated or a cumulative reward of 100000 was achieved. The results of this globality test are shown in Fig. 6. The results show that the VQC with cyclic data re-uploading not only trains to solve the CartPole-v1 environment, but also learns a more stable policy compared to all other agents.

V-B Training on Partial Noisy Trajectory

Results from the above sections are useful to estimate the capabilities of a VQC with cyclic data re-uploading as a quantum agent. However, in real-world scenarios, the usefulness of offline RL algorithms shines in problems where an RL environment is not available or agent-environment interactions during training are not possible. Here, the agent should be able to learn and hold on to an optimal policy from the clean (or noisy) data collected using an expert policy alone. For this purpose, we collected buffers of 100, 50, and 25 environment transitions, based on a pre-trained expert policy. Here, we compare the performance of a VQC with cyclic data re-uploading against 4 classical DNNs consisting of 2 hidden layers with 4, 18, 50, and 100 nodes each. Furthermore, to check whether the agent is able to maintain the learned policy, the early stop** condition was not used here. The results from these experiments are presented in Fig. 5. The quantum agents trained on buffers of size 25 and 100 exhibited a stable learning behavior with a learning rate of 0.001 and when trained with a buffer of size 50 showed a stable behavior with a learning rate of 0.01. Parameter-shift rule with the Adam optimizer was better suited for the quantum agents as the SPSA-based gradients could not hold on to the learned optimal policy. However, all the classical agents showed a stable learning behavior with the same optimizer but a learning rate of 0.0003. All the graphs show that the quantum agent was able to successfully learn to solve the CartPole-v1 environment from partial noisy trajectories. The numerical results indicate that the quantum agent exhibits convergence and performance similar to classical agents with 50 to 250 times more parameters in solving CartPole-v1 environment. The classical agent with a comparable number of parameters struggled in learning an optimal policy when the buffer size was reduced. Our current results indicate a slight advantage depicted by BCQQ in terms of data and the number of trainable parameter requirements. Nonetheless, additional experiments with more complex environments, advanced classical networks, and larger VQCs ( in terms of the number of qubits and the number of parameters), etc. are needed to further substantiate the potential of BCQQ over the BCQ method. However, it is beyond the scope of this work and left for the future.

V-C Validation on real quantum hardware

All the experimental results presented in the above sections illustrate that the VQC can learn an optimal policy in solving the CartPole-v1 environment with just 100 samples. However, all the experiments presented until now were performed using an ideal quantum simulator that estimates exact expectation values. Whereas the NISQ devices that are currently available are hardware noise prone and only capable of estimating the expectation values. Here, a given quantum circuit is executed multiple times repeatedly to estimate the expectation values. The number of repetitions of the circuit is commonly known as shots. The higher the number of shots, the more accurate the estimated expectation values. To analyze the minimum number of shots required by the agent and the performance of the quantum agent under the influence of hardware noise, we tested the best-performing quantum agent from section V-B on a noisy simulator and the real quantum hardware. The performance of the agent using different numbers of shots is shown in figure 7.

Validation results with the noisy simulator show that the agent struggles in solving the CartPole-v1 environment using expectation values estimated with 32 or 64 shots. Yet, the agent achieved its peak performance with 128 shots and further increasing the number of shots did not result in a better performance. Here the impact of noise overshadowed the performance gained with the higher number of shots. Further, we also validated the performance of the quantum agent using ”ibmq_mumbai”, ”ibmq_kolkata” and ”ibm_algiers” devices. However, due to the higher computational intensity of validation and the scarce availability of quantum resources, we limited the validation to 64, 256, and 1024 shots. The validation results on the IBMQ device show that the agent was only able to achieve a reward of 90 at the most due to the significant impact of noise. Nonetheless, the agent was able to achieve a high reward of 495 with the aid of the zero noise extrapolation error mitigation technique [45] and 1024 shots. We speculate that the drop in performance on the IBMQ device could be overcome with further training on the real device. However, further training on the NISQ device is beyond the scope of this work and left for the future.

VI Conclusion

In this paper, we propose discrete BCQQ, a batch RL algorithm for discrete action spaces based on VQCs. The key component is a cyclic data re-uploading scheme, where the encoding layers are not only repeated sequentially throughout the circuit, but the order in which the input feature is encoded in which qubit is cyclically shifted from layer to layer. This leads to a slight increase in the model’s effective dimension, improving its expressivity without increasing the number of model parameters. Experiments in the CartPole-v1 environment demonstrate that the discrete BCQQ can learn offline from partial trajectories. The results also indicate that the BCQQ algorithm may show better generalization and learning capabilities when learning from smaller data compared to classical networks. For future work, we want to investigate how discrete BCQQ scales with more complex VQCs and evaluate its performance in challenging environments. In addition, we expect that cyclic data re-uploading may lead to improvements in other quantum-based machine learning tasks, such as classification, but leave it for future work.

VII Acknowledgement

We thank Dr. Georgios Kontes (Fraunhofer IIS, Nürnberg, Germany), Dr. Steffen Udluft (Siemens AG Technology, Munich, Germany), and Dr. Daniel Hein (Siemens AG Technology, Munich, Germany) for helpful discussions about reinforcement learning. This work was supported by the German Federal Ministry of Education and Research (BMBF), funding program “quantum technologies from basic research to market”, grant number 13N15645.

References

[1] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel et al., “A general reinforcement learning algorithm that masters chess, shogi, and go through self-play,” Science, vol. 362, no. 6419, pp. 1140–1144, 2018.
[2] A. Fawzi, M. Balog, A. Huang, T. Hubert, B. Romera-Paredes, M. Barekatain, A. Novikov, F. J. R Ruiz, J. Schrittwieser, G. Swirszcz et al., “Discovering faster matrix multiplication algorithms with reinforcement learning,” Nature, vol. 610, no. 7930, pp. 47–53, 2022.
[3] R. Agarwal, D. Schuurmans, and M. Norouzi, “Striving for simplicity in off-policy deep reinforcement learning,” 2020. [Online]. Available: https://openreview.net/forum?id=ryeUg0VFwr
[4] O. Kiss, M. Grossi, P. Lougovski, F. Sanchez, S. Vallecorsa, and T. Papenbrock, “Quantum computing of the li 6 nucleus via ordered unitary coupled clusters,” Physical Review C, vol. 106, no. 3, p. 034325, 2022.
[5] P. J. O’Malley, R. Babbush, I. D. Kivlichan, J. Romero, J. R. McClean, R. Barends, J. Kelly, P. Roushan, A. Tranter, N. Ding et al., “Scalable quantum simulation of molecular energies,” Physical Review X, vol. 6, no. 3, p. 031007, 2016.
[6] W. J. Yun, J. P. Kim, S. Jung, J.-H. Kim, and J. Kim, “Quantum multi-agent actor-critic neural networks for internet-connected multi-robot coordination in smart factory management,” IEEE Internet of Things Journal, 2023.
[7] M. C. Caro, H. Huang, M. Cerezo et al., “Generalization in quantum machine learning from few training data,” Nat Commun 13, vol. 4919, 2022.
[8] N. Meyer, C. Ufrecht, M. Periyasamy, D. D. Scherer, A. Plinge, and C. Mutschler, “A survey on quantum reinforcement learning,” arXiv preprint arXiv:2211.03464, 2022.
[9] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” CoRR, vol. abs/1606.01540, 2016. [Online]. Available: http://arxiv.longhoe.net/abs/1606.01540
[10] S. Fujimoto, E. Conti, M. Ghavamzadeh, and J. Pineau, “Benchmarking batch deep reinforcement learning algorithms,” arXiv preprint arXiv:1910.01708, 2019.
[11] M. Van Otterlo and M. Wiering, “Reinforcement learning and markov decision processes,” Reinforcement learning: State-of-the-art, pp. 3–42, 2012.
[12] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
[13] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015.
[14] M. Schuld, A. Bocharov, K. M. Svore, and N. Wiebe, “Circuit-centric quantum classifiers,” Physical Review A, vol. 101, no. 3, p. 032308, 2020.
[15] S. Y.-C. Chen, C.-M. Huang, C.-W. Hsing, and Y.-J. Kao, “Hybrid quantum-classical classifier based on tensor network and variational quantum circuit,” arXiv preprint arXiv:2011.14651, 2020.
[16] A. Blance and M. Spannowsky, “Quantum machine learning for particle physics using a variational quantum classifier,” Journal of High Energy Physics, vol. 2021, no. 2, pp. 1–20, 2021.
[17] M. A. Nielsen and I. L. Chuang, Quantum computation and quantum information. Cambridge university press, 2010.
[18] K. Mitarai, M. Negoro, M. Kitagawa, and K. Fujii, “Quantum circuit learning,” Physical Review A, vol. 98, no. 3, p. 032309, 2018.
[19] M. Schuld, R. Sweke, and J. J. Meyer, “Effect of data encoding on the expressive power of variational quantum-machine-learning models,” Physical Review A, vol. 103, no. 3, p. 032430, 2021.
[20] F. J. Gil Vidal and D. O. Theis, “Input redundancy for parameterized quantum circuits,” Frontiers in Physics, vol. 8, p. 297, 2020.
[21] A. Skolik, S. Jerbi, and V. Dunjko, “Quantum agents in the gym: a variational quantum algorithm for deep q-learning,” Quantum, vol. 6, p. 720, 2022.
[22] M. Schuld, V. Bergholm, C. Gogolin, J. Izaac, and N. Killoran, “Evaluating analytic gradients on quantum hardware,” Phys. Rev. A, vol. 99, p. 032331, Mar 2019. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevA.99.032331
[23] A. Pellow-Jarman, I. Sinayskiy, A. Pillay, and F. Petruccione, “A comparison of various classical optimizers for a variational quantum linear solver,” Quantum Information Processing, vol. 20, no. 6, p. 202, 2021.
[24] X. Bonet-Monroig, H. Wang, D. Vermetten, B. Senjean, C. Moussa, T. Bäck, V. Dunjko, and T. E. O’Brien, “Performance comparison of optimization methods on variational quantum algorithms,” arXiv preprint arXiv:2111.13454, 2021.
[25] I. Miháliková, M. Friák, M. Pivoluska, M. Plesch, M. Saip, and M. Šob, “Best-practice aspects of quantum-computer calculations: A case study of the hydrogen molecule,” Molecules, vol. 27, no. 3, p. 597, 2022.
[26] S. J. Reddi, S. Kale, and S. Kumar, “On the convergence of adam and beyond,” arXiv preprint arXiv:1904.09237, 2019.
[27] M. Wiedmann, M. Hölle, M. Periyasamy, N. Meyer, C. Ufrecht, D. D. Scherer, A. Plinge, and C. Mutschler, “An empirical comparison of optimizers for quantum machine learning with spsa-based gradients,” arXiv preprint arXiv:2305.00224, 2023.
[28] D. Dong, C. Chen, and Z. Chen, “Quantum reinforcement learning,” in Advances in Natural Computation: First International Conference, ICNC 2005, Changsha, China, August 27-29, 2005, Proceedings, Part II 1. Springer, 2005, pp. 686–689.
[29] D. Dong, C. Chen, H. Li, and T.-J. Tarn, “Quantum reinforcement learning,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 38, no. 5, pp. 1207–1220, 2008.
[30] L. K. Grover, “A fast quantum mechanical algorithm for database search,” in Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, 1996, pp. 212–219.
[31] S. Y.-C. Chen, C.-H. H. Yang, J. Qi, P.-Y. Chen, X. Ma, and H.-S. Goan, “Variational quantum circuits for deep reinforcement learning,” IEEE Access, vol. 8, pp. 141 007–141 024, 2020.
[32] Z. Cheng, K. Zhang, L. Shen, and D. Tao, “Offline quantum reinforcement learning in a conservative manner,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 6, 2023, pp. 7148–7156.
[33] S. Fujimoto, D. Meger, and D. Precup, “Off-policy deep reinforcement learning without exploration,” in International conference on machine learning. PMLR, 2019, pp. 2052–2062.
[34] S. Thrun and A. Schwartz, “Issues in using function approximation for reinforcement learning,” in Proceedings of the Fourth Connectionist Models Summer School, vol. 255. Hillsdale, NJ, 1993, p. 263.
[35] S. Levine, A. Kumar, G. Tucker, and J. Fu, “Offline reinforcement learning: Tutorial, review, and perspectives on open problems,” arXiv preprint arXiv:2005.01643, 2020.
[36] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” in Proceedings of the AAAI conference on artificial intelligence, vol. 30, 2016.
[37] M. Franz, L. Wolf, M. Periyasamy, C. Ufrecht, D. D. Scherer, A. Plinge, C. Mutschler, and W. Mauerer, “Uncovering instabilities in variational-quantum deep q-networks,” Journal of The Franklin Institute, 2022.
[38] N. Meyer, D. Scherer, A. Plinge, C. Mutschler, and M. Hartmann, “Quantum policy gradient algorithm with optimized action decoding,” in International Conference on Machine Learning. PMLR, 2023, pp. 24 592–24 613.
[39] A. Pérez-Salinas, A. Cervera Lierta, E. Gil-Fuster, and J. Latorre, “Data re-uploading for a universal quantum classifier,” Quantum, vol. 4, p. 226, 02 2020.
[40] M. Periyasamy, N. Meyer, C. Ufrecht, D. D. Scherer, A. Plinge, and C. Mutschler, “Incremental data-uploading for full-quantum classification,” in 2022 IEEE International Conference on Quantum Computing and Engineering (QCE). IEEE, 2022, pp. 31–37.
[41] O. Berezniuk, A. Figalli, R. Ghigliazza, and K. Musaelian, “A scale-dependent notion of effective dimension,” arXiv preprint arXiv:2001.10872, 2020.
[42] A. Abbas, D. Sutter, C. Zoufal, A. Lucchi, A. Figalli, and S. Woerner, “The power of quantum neural networks,” Nature Computational Science, vol. 1, no. 6, pp. 403–409, 2021.
[43] J. Rissanen, “Fisher information and stochastic complexity,” IEEE Transactions on Information Theory, vol. 42, no. 1, pp. 40–47, 1996.
[44] Qiskit contributors, “Qiskit: An open-source framework for quantum computing,” 2023.
[45] K. Temme, S. Bravyi, and J. M. Gambetta, “Error mitigation for short-depth quantum circuits,” Physical Review Letters, vol. 119, no. 18, Nov. 2017. [Online]. Available: http://dx.doi.org/10.1103/PhysRevLett.119.180509