DRLQ: A Deep Reinforcement Learning-based Task Placement for Quantum Cloud Computing

Hoa T. Nguyen¹, Muhammad Usman^2,3, and Rajkumar Buyya¹ [email protected], {muhammad.usman, rbuyya}@unimelb.edu.au ¹Cloud Computing and Distributed Systems (CLOUDS) Laboratory,
School of Computing and Information Systems, The University of Melbourne, Parkville, 3052, Victoria, Australia ²School of Physics, The University of Melbourne, Parkville, 3052, Victoria, Australia ³Data61, CSIRO, Clayton, 3168, Victoria, Australia

Abstract

The quantum cloud computing paradigm presents unique challenges in task placement due to the dynamic and heterogeneous nature of quantum computation resources. Traditional heuristic approaches fall short in adapting to the rapidly evolving landscape of quantum computing. This paper proposes DRLQ, a novel Deep Reinforcement Learning (DRL)-based technique for task placement in quantum cloud computing environments, addressing the optimization of task completion time and quantum task scheduling efficiency. It leverages the Deep Q Network (DQN) architecture, enhanced with the Rainbow DQN approach, to create a dynamic task placement strategy. This approach is one of the first in the field of quantum cloud resource management, enabling adaptive learning and decision-making for quantum cloud environments and effectively optimizing task placement based on changing conditions and resource availability. We conduct extensive experiments using the QSimPy simulation toolkit to evaluate the performance of our method, demonstrating substantial improvements in task execution efficiency and a reduction in the need to reschedule quantum tasks. Our results show that utilizing the DRLQ approach for task placement can significantly reduce total quantum task completion time by 37.81% to 72.93% and prevent task rescheduling attempts compared to other heuristic approaches.

Index Terms:

quantum cloud, task placement, resource management, reinforcement learning, quantum cloud scheduling.

I Introduction

Quantum computing is at the forefront of technological innovation, with the potential to drive advances in fields such as cryptography [1], finance [2], machine learning [3], and complex chemical simulation [4]. It has the capability to solve numerous problems that are currently intractable by classical computers. Besides, the emergence of quantum cloud computing [5] represents a major advancement in providing access to the computational capabilities of quantum computing. This integration enables users worldwide to execute quantum algorithms on remotely accessible quantum computers, thereby overcoming the significant barriers of cost and physical access associated with quantum hardware [6]. The quantum cloud paradigm has not only extended the accessibility of quantum computing but has also presented new challenges and opportunities in optimizing the utilization of these cloud-based quantum computation resources.

Refer to caption — Figure 1: Overview of the system model for the task placement problem in quantum cloud environments

Despite its significant potential, the efficient utilization of quantum cloud computing resources faces substantial challenges, particularly in the context of quantum cloud resource orchestration. Specifically, quantum task placement, i.e., selecting an appropriate quantum backend or physical hardware and associated parameters for executing quantum tasks, is crucial for the performance and reliability of quantum computations [7]. However, the current landscape of quantum task placement is denoted by dependence on heuristic approaches or manually crafted policies [8]. Although practical in certain contexts, these approaches do not take full advantage of the dynamic capabilities of quantum cloud computing environments. They lack the flexibility and adaptability required to optimize performance in the face of ongoing advancements in quantum hardware and the increasing complexity of practical quantum applications. [9].

This scenario underscores the critical need for novel, adaptive techniques in resource management capable of harnessing the potential of quantum cloud computing. Integrating deep reinforcement learning (DRL) into the quantum task placement process presents a promising approach to address these challenges. By incorporating the principles of DRL, which is well-suited for navigating complex and dynamic environments such as cloud-edge in the classical domain with several successful examples such as [10, 11, 12], a DRL-based task placement strategy can potentially enhance quantum cloud resource management. As far as we know, this study is the first attempt to apply a deep reinforcement learning-based technique designed for the task placement problem in quantum cloud computing environments [5]. Our approach aims to navigate the complexities of quantum systems dynamics, balancing performance metrics and adaptively learning the optimal task placement policy through continuous interactions with the quantum computing environment. Our proposed methodology leveraging Deep Q Networks (DQN) architecture and empowered by Rainbow DQN approach [13], which combines advantages of different DQN algorithms, including Double DQN, Prioritized Replay, Multi-step learning, Distributional RL, and Noisy Nets, seeks to optimize task placement in quantum cloud computing environment.

The major contributions and novelty of our work are:

•

We propose one of the first applications of DRL techniques to address the task placement problem in quantum cloud computing, leveraging the enhanced combining improvement of the Rainbow DQN technique [13] for robust and adaptive decision-making.
•

Through extensive experimentation, we have shown that our DRLQ approaches can significantly reduce the total completion time by 37.81% to 72.93% and minimize the need for task rescheduling compared to other popular heuristic-based approaches.
•

Our findings emphasize a possible method for tackling the problem of quantum task placement in order to optimize resource management in quantum cloud computing environments. This provides a starting point for additional thorough research that considers more quantum-specific properties, such as execution accuracy, quantum circuit transpilation, and quantum error rates.

The rest of the paper is organized as follows: Section II reviews related work, identifying gaps our research addresses. Section III describes the system model and problem formulation for task placement in quantum cloud computing, then explains our DRL model, focusing on the DQN architecture and Rainbow approach. Section IV evaluates our method’s performance through simulations and discusses the implications of our findings. Section V concludes the paper by summarizing our contributions along with future work.

II Related Work

The literature on task placement in quantum cloud computing is nascent, with only few works beginning to address the unique challenges posed by this emerging paradigm. This section briefly reviews the existing studies, highlighting the pioneering efforts in task placement within quantum cloud computing and the potential of DRL to innovate in this area. We summarise several related works in the quantum cloud domain and representative works in the classical cloud-edge domain in Table I.

TABLE I: Representative works related to our study

References

Resource Management

Problem

Environment

Approach

[14]

Task Placement

Classical Cloud

DRL

[12]

Task Placement

Classical Edge

DRL

[8]

Task Placement

Quantum Cloud

Heuristics

[15]

Qubit allocation

Quantum Cloud

Heuristics

[16]

Resource Allocation

Quantum Network

Heuristics

Our work

Task Placement

Quantum Cloud

DRL

The traditional task placement strategies used in classical cloud computing cannot be directly applied to the quantum context due to differences in quantum computational tasks and resource characteristics. For example, the fundamental difference between the characteristics of a quantum task and a classical task is their information unit, i.e., quantum bits and classical bits [17]. Besides, current quantum computation processors can be characterized by different benchmarking metrics, such as circuit layer operations per seconds (CLOPS) and quantum volume (QV) [18]. Meanwhile, applying deep reinforcement learning (DRL) in optimizing such tasks represents a potential approach in similar resource management problems in the classical domain [12, 14]. It offers promising yet unexplored solutions for dynamic and efficient resource management in quantum cloud environments.

A few studies have focused on heuristic approaches to designing quantum cloud resource management policies. Ravi et al. [19] analyzed quantum job characteristics of IBM Quantum cloud systems and then proposed an adaptive job scheduling approach based on a basic statistical analysis of historical data in their subsequent work [8]. Ngoenriang et al. [20] proposed a two-stage stochastic programming technique to allocate resources for distributed quantum computing. The technique aims to minimize the deployment cost and maximize quantum resource utilization while accounting for uncertainties like quantum task demands and computation power. Kaewpuang et al. [15] proposed a new method for allocating qubits in a quantum cloud that considers the uncertainties of quantum circuit requirements and expected waiting time. The method consists of two stages: reservation and on-demand. In the reservation stage, historical data is used to determine the allocation of resources, while in the on-demand stage, actual requirements are considered. Cicconetti et al. [16] proposed a resource allocation technique for distributed quantum computing, focusing on quantum network aspects. They used the Weighted Round Robin algorithm to assign network resources based on pre-calculated traffic flow weights. They developed a network provisioning simulator for evaluation, showing trade-offs between fairness and time complexity.

However, these existing works do not consider the characteristics of heterogeneous quantum computing systems and circuit-based metrics of quantum tasks when designing the scheduling algorithm [5]. Furthermore, none of the existing works leverage machine learning-based approaches for quantum task placement problems in cloud-based environments. Our paper fills this important gap and contributes to the foundational knowledge and advancement of task placement strategies in the quantum cloud computing domain.

III System Model and Problem Formulation

III-A System Model

Figure 1 represents an overview of our system model in quantum cloud computing environments, which is derived from our studies on the QSimPy [21] toolkit for quantum cloud resource management. Since quantum applications cannot be permanently installed in a quantum computer, classical cloud resources are required to host these applications. Additionally, these applications can be made available as a service [7]. Users send task requests from their local devices to the classical cloud layer, where the quantum application is deployed. A corresponding quantum task (QTask), which comprises single or multiple quantum circuits, will be created for each incoming request. The broker (or scheduler) then makes a placement decision for each QTask based on its requirement and the current state of available quantum cloud computation resources.

Quantum Nodes: The set of available quantum computation nodes (QNodes) at a quantum data center is defined as $\mathcal{Q}={q_{1},q_{2},...,q_{m}}$ , where $m=|\mathcal{Q}|$ indicates the number of available QNodes. We assume that each QNode has a single quantum processing unit (QPU), reflecting the current state of available quantum computers. Each QNode has different properties, such as qubit number ( $q^{w}$ ), quantum volume (QV) ( $q^{v}$ ) [22], circuit layer operation per second (CLOPS) ( $q^{s}$ ) [18], supported gates ( $q^{g}$ ), and qubit topology ( $q^{t}$ ).

Quantum Tasks: We consider each quantum task (QTask) to comprise a gate-based quantum circuit. Thus, a set of incoming quantum tasks can be defined as $\Theta=\{\theta_{1},\theta_{2},...,\theta_{n}\}$ , where $n=|\Theta|$ is the number of QTasks. Each QTask $\theta_{i}$ has various properties, such as qubit number ( $\theta^{w}$ ), circuit depth ( $\theta^{d}$ ), used quantum gates ( $\theta^{g}$ ), number of shots ( $\theta^{s}$ ), qubit topology ( $\theta^{t}$ ), and arrival time ( $\theta^{a}$ ). The circuit depth ( $\theta^{d}$ ) of a quantum circuit is defined as depth-1 circuit layer, where the depth of the following components can be considered as 1: a) a single-qubit gate from native gate set, b) a measurement, c) a reset, d) a 2-qubit gate from native gate set [18].

III-B Problem Formulation

The placement configuration of QTask $\theta_{i}\in\Theta$ can be define as $\xi_{i}=\{\theta_{i},q_{j}\}$ where $q_{j}\in\mathcal{Q}$ and $1\leq j\leq|\mathcal{Q}|$ is the selected quantum node index. If the placement fails due to a violation of resource constraints, such as placing a task to a QNode that does not have enough qubits, the broker needs to do the replacement (or rescheduling) to find another suitable QNode for the task execution.

Completion Time Model: The total completion time (or the makespan) represents the total waiting time for each request from the submission to the completion, including executing time and queueing time as follows:

t_{\theta_{i}}=t^{wait}_{\theta_{i}}+t^{exec}_{\theta_{i}}

(1)

where $t^{wait}_{\theta_{i}}$ is the queuing time (from the arrival time till the execution start time) and $t^{exec}_{\theta_{i}}$ is the quantum execution time. The execution time of a quantum task depends on the corresponding computer’s required quantum circuit layer and QPU speed (CLOPS). Based on IBM Quantum study [18], we use depth-1 circuit layer operation per second (D1CPS) instead of CLOPS for $q^{s}$ , which is used to measure the number of depth-1 circuit layers (or circuit depth) of a quantum circuit that can be executed per second. The execution time of a QTask $\theta_{i}$ in QNode $q_{j}$ can be estimated as follows:

t^{exec}_{\theta_{i}}=\dfrac{\theta_{i}^{d}\times\theta_{i}^{s}}{q_{j}^{s}}

(2)

where $\theta_{i}^{d}$ is the circuit depth (or number of depth-1 circuit layers), $\theta_{i}^{s}$ is the number of shots to be executed, and $q_{j}^{s}$ is D1CPS of the QNode.

Problem Statement: Given a data center with heterogeneous quantum computation nodes (QNodes) and a continuous incoming workload (QTasks), design the task placement policy to select the most appropriate QNode for each incoming QTask object to minimize the total response time of all QTasks and mitigate the replacement frequency due to violation of the execution constraints. The objective can be defined as:

\Omega(\Xi)=\min\ \sum_{i=1}^{n}t_{\theta_{i}}

(3)

s.t.

C1:Size(\theta_{i})=1,\forall\theta_{i}\in\Theta

(4)

C2:t^{start}_{\theta_{j}}\geq t^{start}_{\theta_{h}}+t_{\theta_{h}},\forall% \theta_{h},\theta_{j}\in\Theta,h<j\leq|\Theta|

(5)

C3:\theta_{i}^{w}\leq q_{j}^{w},\forall\theta_{i}\in\Theta,q_{j}\in\mathcal{Q}

(6)

where $C1$ specifies that each QTask can only be assigned to one QNode at the time for the execution; $C2$ indicates that the QTask $\theta_{j}$ can only be executed in the selected QNode after the completion of its predecessor at the same QNode (QTask $\theta_{h}$ ); and $C3$ requires the selected QNode $q_{j}$ need to have enough qubit for the execution of QTask $\theta_{i}$ . If the QNode does not have enough qubits, the placement will fail and rescheduling will be necessary.

III-C Deep Reinforcement Learning Model

Deep Reinforcement Learning (DRL) employs deep neural networks to tackle decision-making with high-dimensional states, formulated as Markov Decision Processes (MDP) [23]. An MDP is denoted as $(\mathbb{S},\mathbb{A},\mathbb{P},\mathbb{R},\gamma)$ , with $\mathbb{S}$ and $\mathbb{A}$ representing the sets of states and actions, respectively. $\mathbb{P}$ defines the transition probabilities, $\mathbb{R}$ is the reward function, and $\gamma\in[0,1]$ is the discount factor, indicating the preference for future rewards. During discrete time steps $t$ , a DRL agent observes a state $s_{t}$ , selects an action $a_{t}$ from policy $\pi(a_{t}|s_{t})$ , and transitions to a new state $s_{t+1}$ , receiving a reward $r_{t}$ . The agent aims to maximize the expected return $\mathbb{V}^{\pi}(s_{t})=\mathbb{E}_{\pi}[\sum_{t}\gamma^{t}r_{t}],t\in\mathbb{T}$ , the sum of discounted rewards obtained by following policy $\pi$ from $s_{t}$ . The policy, often a neural network, is refined through training to optimize performance.

State Space $\mathbb{S}$ : A state of the agent’s observation from the quantum cloud computing environment, which includes 1) information on all available QNodes and 2) information on current QTasks to be placed (or scheduled).

The feature vector of $m$ quantum nodes in $\mathcal{Q}$ , each quantum node has $o$ features, at time step $t$ can be presented as:

\mathcal{F}^{\mathcal{Q}}_{t}=\{f^{q_{i}^{z}}|\forall q_{i}\in\mathcal{Q},1% \leq i\leq m,1\leq z\leq o\}

(7)

where $i$ is the index of a quantum node, and $z$ is the index of a quantum node’s feature.

The feature vector of the current quantum task $\theta_{j}$ in $\Theta$ , with $p$ features at time step $t$ can be presented as:

\mathcal{F}^{\theta_{j}}_{t}=\{f^{\theta_{j}^{k}}|\theta_{j}\in\Theta,1\leq k% \leq p\}

(8)

where $k$ is the index of the current quantum task’s feature.

All features of quantum nodes and quantum tasks are described in Section III-A. Each QTask specification is only sent to the DRLQ agent as part of the state after its arrival. Therefore, the State space of the system can be defined as:

\mathbb{S}=\{s_{t}|s_{t}=(\mathcal{F}_{t}^{\mathcal{Q}},\mathcal{F}_{t}^{% \theta_{j}}),\forall t\in\mathbb{T}\}

(9)

Action Space $\mathbb{A}$ : An action can be defined as the placement of a QTask on an available quantum node. The action $a_{t}$ at time step $t$ is an placement of QTask $\theta_{j}$ to quantum node $q_{i}$ can be defined as

a_{t}=\xi_{i}=\{q_{i},\theta_{j}\},q_{i}\in\mathcal{Q},\theta_{j}\in\Theta

(10)

Thus, the Action space is equivalent to the set of all available quantum nodes at the data center:

\mathbb{A}=\mathcal{Q}

(11)

Reward Function $\mathbb{R}$ : The main goal is to minimize the total completion of all incoming tasks. Besides, we also aim to mitigate task replacement attempts (or maximise the success rate of the task placement). To achieve these objectives, we define the reward $r_{t}$ at time step $t$ as follows:

r_{t}=\begin{cases}\dfrac{1}{t_{\theta_{i}}}\times(1-\alpha\kappa)&\text{if }% \text{done}=1\\ \Delta\times(1+\alpha\kappa)&\text{if }\text{done}=0\end{cases}

(12)

where $t_{\theta_{i}}$ is the total completion time of QTask $\theta_{i}$ , $\alpha$ is the penalty factor, and $\kappa$ is the replacement count. If the QTask is successful ( $done=1$ ), the inverse value of its total completion time is assigned for the reward to encourage the policy to find a better placement that has a shorter total completion time to get a higher reward $r_{t}$ . Otherwise, if the task execution fails for any reason, we apply a large negative value $\Delta$ for penalty and advise the policy to avoid similar action in the future. Besides, we also consider $\kappa$ - the number of replacements (or rescheduling attempts) of QTask $\theta_{i}$ and define a penalty factor $\alpha$ when assigning the reward in order to mitigate the replacement. The penalty factor acts as an additional discount factor when a QTask needs more than one placement to be successful and also magnifies the penalty if that QTask fails multiple times. Thus, our reward function can be used to achieve the main objective of minimizing the total completion time and reducing the number of task replacements.

III-D DRLQ Framework

Our DRLQ framework employs an enhanced deep reinforcement learning technique, combining Deep Q-Networks (DQN) and the Rainbow approach [13] to optimize task placement in quantum cloud computing environments.

The Deep Q-Network (DQN) algorithm, introduced by Mnih et al. [24, 23], represents a significant advancement in reinforcement learning, utilizing deep neural networks to approximate the action-value function $Q(s,a;\theta)$ . The objective is to minimize the loss function:

L(\theta)=\mathbb{E}\left[\left(r+\gamma\max_{a^{\prime}}Q(s^{\prime},a^{% \prime};\theta^{-})-Q(s,a;\theta)\right)^{2}\right]

(13)

where $y_{i}=r+\gamma\max_{a^{\prime}}Q(s^{\prime},a^{\prime};\theta^{-})$ defines the target for a state-action pair $(s,a)$ , with $r$ as the immediate reward, $\gamma$ as the discount factor, $s^{\prime}$ as the subsequent state, and $\theta^{-}$ denotes the parameters of a target network that is periodically updated to stabilize the learning process. The overall process of the DRLQ framework can be represented in the Algorithm 1.

1 Initialize the quantum cloud environment in QSimPy;

2 Loading dataset for the QTask generator module;

3 Register the environment with Ray;

4 Initialize replay buffer

\mathcal{D}

to capacity

N

and priority replay configuration;

5 Initialize action-value policy

Q

with random weights;

6 for each hyperparameter configuration do

7 Set hyperparameters using Ray Tune;

8 for episode $=1,M$ do

9 Perform DQN process, defined in [24];

10 Adjust rewards and transitions for

n

-step;

11 Add parameter noise to network weights for exploration;

12 Update

Q

using a distributional approach;

14 end for

16 end for

17Select the best configuration from Ray Tune results;

Algorithm 1 DRLQ Framework for QTask Placement

First, we initialize the quantum cloud environment following the reinforcement learning setting. Due to the limitation of managing the practical quantum cloud environment setup, we utilize a simulated quantum cloud environment by utilizing the QSimPy simulator [21] and a quantum application dataset, such as MQTBench [25], for generating the synthetic QTask data for the environment. Then, we register the environment with Ray [26], a comprehensive machine-learning framework for managing the training and tuning. We utilize a replay buffer to enhance learning efficiency by decoupling consecutive training samples, thereby providing a diverse set of experiences for more robust neural network training. The replay buffer configuration and other training hyperparameters need to be defined for the hyperparameter tuning process. We leverage the Rainbow DQN approaches [13] to combine all the advantages of the different DQN approaches, such as Multi-step Learning [27], Distributional RL [28], Prioritized Replay [29], and Noisy Nets [30]. These enhancements of DQN can collectively improve the training efficiency and effectiveness in quantum task placement, offering a potential approach to learning in the complex and dynamic environments of quantum cloud computing. Finally, we determine the best hyperparameter configuration based on the tuning process using Ray Tune [31].

IV Performance Evaluation

IV-A Environment Setup

To evaluate the performance of our DRLQ technique, we set up a simulated environment that reflects the actual quantum cloud environment due to current limitations on accessing and managing a quantum data center. We use QSimPy [21], a learning-centric framework derived from the iQuantum toolkit [32], to create the quantum cloud environment. All experiments are conducted on a computation instance with 16 vCPUs and 64GB of RAM at the Melbourne Research Cloud.

We model a quantum data center with 10 heterogeneous quantum nodes, ranging from 16 qubits to 127 qubits, using quantum benchmarking metrics [18] from IBM Quantum and backend instances in Qiskit. The modeled quantum nodes included ibm_sherbrooke, ibm_washington, ibm_brisbane, ibm_osaka, ibm_nazca, ibm_kyoto, ibm_cusco, ibm_kolkata, ibm_hanoi, ibm_guadalupe. We used Qiskit [33] for the transpilation of circuits in incoming QTasks to the selected QNode and extracted the circuit metrics after transpilation to mimic the process when a quantum circuit reaches the quantum node for further execution. To simulate stochastic incoming quantum tasks with metrics from actual quantum applications, we selected 12 quantum applications from the MQTBench dataset [25], which contains quantum circuits in QASM files ranging from 2 to 50 qubits each. The selected quantum applications from the MQTBench dataset include:

1.

Amplitude Estimation (AE)
2.

Deutsch-Jozsa algorithm
3.

Greenberger–Horne–Zeilinger (GHZ) state
4.

Quantum Fourier Transformation
5.

Entangled Quantum Fourier Transformation
6.

Quantum Neural Network (QNN)
7.

Quantum Phase Estimation (QPE) exact
8.

Quantum Phase Estimation (QPE) inexact
9.

Random circuits
10.

Real Amplitudes ansatz with Random Parameters
11.

Efficient SU2 ansatz with Random Parameters
12.

Two Local ansatz with random parameters

The number of shots is set to 1024 by default. We randomly select QTasks for each episode in the reinforcement learning process from the synthetic QTask dataset. The QTask arrival times were generated following a Poisson distribution.

After setting up the Gymnasium-based environment for the quantum cloud, we used Ray RLlib [34], an industry-grade reinforcement learning framework, to implement the proposed method of DRLQ and several baseline algorithms for performance comparison. We evaluated the performance of DRLQ against other popular heuristic approaches, including:

•

Greedy: QTasks are greedily assigned to the QNode with the shortest waiting time, similar to an approach in [7]. If these tasks fail, they are assigned to the most powerful QNode (i.e., the one with the largest number of qubits).
•

Round Robin: QTasks are assigned to QNodes in a cyclic order, ensuring a balanced distribution of tasks across all available QNodes.
•

Random: QTasks are randomly assigned to QNodes.

IV-B Evaluation and Discussion

We used a similar evaluation approach following other DRL-based task scheduling works in classical computing, such as [35, 12], to evaluate the performance of our framework using reward values after 100 training iterations, which involves 100,000 time steps. We conducted extensive experiments and used Ray Tune [31] to optimize hyperparameters through the grid search method. The results of the best configuration of DRLQ are shown in Figure 2.

Our main objective is to minimize the total completion time and the number of task rescheduling (or replacement) attempts. Figure 2 clearly demonstrates the efficient learning process of the DRLQ method. As the reward is inversely proportional to the total completion time, the upward trend in the reward indicates a reduction in total completion time during the training episodes. The reward continuously increases after the first 20 training iterations, reaching convergence after 90 training iterations (each iteration consists of 1,000 time steps). Simultaneously, the episode length significantly reduces and converges around 60, which is the minimum length of an episode, after the first 25 training iterations. We then exported the trained policy after 100 training iterations for evaluation on different QTask workload datasets to compare the effectiveness of DRLQ against other heuristic approaches.

Figure 3(a) compares DRLQ with other baseline techniques for QTask placements, considering the total completion time of all QTasks over 100 episodes, each consisting of 60 random incoming QTasks different from the training set of DRLQ. The average total completion times of all QTasks in each episode after 100 evaluation episodes across DRLQ and other baselines are shown in Figure 3(b). Our evaluation results indicate that the DRLQ algorithm significantly improves efficiency by minimizing the total completion time. It achieves a 37.81% reduction in the total completion time compared to the Greedy algorithm, a 72.93% reduction compared to the Random approach, and a 70.71% reduction compared to the Round Robin algorithm over 100 evaluation episodes.

We also evaluate the performance of DRLQ by considering the number of task rescheduling attempts per episode. Figure 4 shows the average number of task rescheduling attempts over 100 evaluation episodes for all approaches.

The results show that our DRLQ approach significantly outperforms other methods in mitigating task rescheduling attempts. Specifically, DRLQ achieves zero rescheduling attempts, which is a substantial improvement over the Greedy, Random, and Round Robin approaches, with average rescheduling attempts of 26.95, 7.88, and 8.11, respectively. These improvements are especially important in quantum cloud computing environments, where minimizing task completion time is crucial due to the high costs of quantum resources and the inherent variability of quantum computations.

V Conclusions and Future Work

In this study, we have explored the effectiveness and potential of utilizing a deep reinforcement learning approach in proposing a novel DRLQ framework for task placement in quantum cloud computing environments. Our results showcase the significant improvement of the DRLQ technique in quantum task placement compared to other heuristic approaches. It highlights the potential of using a DRL-based approach as a robust quantum cloud resource management. Our work is one of the first studies on the resource management problem of quantum cloud computing, which requests more attention from the research community in the quantum cloud domain.

Our future research directions will focus on several aspects to improve the potential of this approach. We are leveraging more advanced deep reinforcement learning techniques to improve performance in diverse quantum computing contexts. Besides, we aim to evaluate DRLQ in real quantum cloud systems to verify its practicality and fine-tune its application. We are also considering other properties of NISQ devices, such as error rates and quantum coherence, to explore the potential impacts of quantum mechanics on computational outcomes.

Acknowledgments

The research is partially supported by the University of Melbourne through an ARC Discovery Project (awarded to Prof. Buyya). Hoa Nguyen acknowledges the support from the Science and Technology Scholarship Program for Overseas Study for Master’s and Doctoral Degrees, Vingroup, Vietnam.

References

[1] S. Pirandola, U. L. Andersen, L. Banchi, M. Berta, D. Bunandar, R. Colbeck, D. Englund, T. Gehring, C. Lupo, C. Ottaviani, J. L. Pereira, M. Razavi, J. S. Shaari, M. Tomamichel, V. C. Usenko, G. Vallone, P. Villoresi, and P. Wallden, “Advances in quantum cryptography,” Adv. Opt. Photon., vol. 12, pp. 1012–1236, Dec 2020.
[2] D. J. Egger, C. Gambella, J. Marecek, S. McFaddin, M. Mevissen, R. Raymond, A. Simonetto, S. Woerner, and E. Yndurain, “Quantum computing for finance: State-of-the-art and future prospects,” IEEE Transactions on Quantum Engineering, vol. 1, pp. 1–24, 2020.
[3] M. T. West, S.-L. Tsang, J. S. Low, C. D. Hill, C. Leckie, L. C. L. Hollenberg, S. M. Erfani, and M. Usman, “Towards quantum enhanced adversarial robustness in machine learning,” Nature Machine Intelligence, vol. 5, p. 581–589, May 2023.
[4] A. Kandala, A. Mezzacapo, K. Temme, M. Takita, M. Brink, J. M. Chow, and J. M. Gambetta, “Hardware-efficient variational quantum eigensolver for small molecules and quantum magnets,” Nature, vol. 549, p. 242–246, Sep 2017.
[5] H. T. Nguyen, P. Krishnan, D. Krishnaswamy, M. Usman, and R. Buyya, “Quantum Cloud Computing: A Review, Open Problems, and Future Directions,” 2024. arXiv:2404.11420.
[6] M. Kaiiali, S. Sezer, and A. Khalid, “Cloud computing in the quantum era,” in Proceedings of the 2019 IEEE Conference on Communications and Network Security (CNS), (Washington, DC, USA), pp. 1–4, 2019.
[7] H. T. Nguyen, M. Usman, and R. Buyya, “QFaaS: A Serverless Function-as-a-Service framework for Quantum computing,” Future Generation Computer Systems, vol. 154, p. 281–300, May 2024.
[8] G. S. Ravi, Smith, and F. T. Chong, “Adaptive job and resource management for the growing quantum cloud,” in Proceedings of the 2021 IEEE International Conference on Quantum Computing and Engineering (QCE), (Broomfield, CO, USA), pp. 301–312, IEEE, Oct 2021.
[9] A. M. Dalzell, S. McArdle, M. Berta, P. Bienias, C.-F. Chen, A. Gilyén, C. T. Hann, M. J. Kastoryano, E. T. Khabiboulline, A. Kubica, G. Salton, S. Wang, and F. G. S. L. Brandão, “Quantum algorithms: A survey of applications and end-to-end complexities,” 2023. arXiv:2310.03011.
[10] H. v. Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Q-Learning,” in Proceedings of the 30th AAAI Conference on Artificial Intelligence, AAAI’16, (Phoenix, Arizona), p. 2094–2100, AAAI Press, 2016.
[11] D. Yi, X. Zhou, Y. Wen, and R. Tan, “Efficient compute-intensive job allocation in data centers via deep reinforcement learning,” IEEE Transactions on Parallel and Distributed Systems, vol. 31, no. 6, pp. 1474–1485, 2020.
[12] M. Goudarzi, M. Palaniswami, and R. Buyya, “A distributed deep reinforcement learning technique for application placement in edge and fog computing environments,” IEEE Transactions on Mobile Computing, vol. 22, no. 5, pp. 2491–2505, 2023.
[13] M. Hessel, J. Modayil, H. Van Hasselt, Schaul, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver, “Rainbow: Combining Improvements in Deep Reinforcement Learning,” vol. 32, (New Orleans, Louisiana, USA), AAAI Press, Apr 2018.
[14] Z. Chen, J. Hu, G. Min, C. Luo, and T. El-Ghazawi, “Adaptive and efficient resource allocation in cloud datacenters using actor-critic deep reinforcement learning,” IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 8, pp. 1911–1923, 2022.
[15] R. Kaewpuang, M. Xu, D. Niyato, H. Yu, Z. Xiong, and J. Kang, “Stochastic qubit resource allocation for quantum cloud computing,” in Proceedings of the NOMS 2023-2023 IEEE/IFIP Network Operations and Management Symposium, (Miami, FL, USA), pp. 1–5, 2023.
[16] C. Cicconetti, M. Conti, and A. Passarella, “Resource allocation in quantum networks for distributed quantum computing,” in 2022 IEEE International Conference on Smart Computing (SMARTCOMP), (Helsinki, Finland), pp. 124–132, IEEE, 2022.
[17] F. Leymann, J. Barzen, M. Falkenthal, D. Vietz, B. Weder, and K. Wild, “Quantum in the Cloud: Application Potentials and Research Opportunities,” in Proceedings of the 10th International Conference on Cloud Computing and Service Science (CLOSER 2020), pp. 9–24, SciTePress, May 2020.
[18] A. Wack, H. Paik, A. Javadi-Abhari, P. Jurcevic, I. Faro, J. M. Gambetta, and B. R. Johnson, “Quality, speed, and scale: three key attributes to measure the performance of near-term quantum computers,” 2021. arXiv:2110.14108.
[19] G. Ravi, K. N. Smith, P. Gokhale, and F. T. Chong, “Quantum computing in the cloud: Analyzing job and machine characteristics,” in Proceedings of the 2021 IEEE International Symposium on Workload Characterization (IISWC), (Los Alamitos, CA, USA), pp. 39–50, IEEE Computer Society, Nov 2021.
[20] N. Ngoenriang, Xu, Niyato, Xuemin, and Shen, “Optimal Stochastic Resource Allocation for Distributed Quantum Computing,” 2022. arXiv:2210.02886.
[21] H. T. Nguyen, M. Usman, and R. Buyya, “QSimPy: A Learning-centric Simulation Framework for Quantum Cloud Resource Management,” 2024. arXiv:2405.01021.
[22] A. W. Cross, L. S. Bishop, S. Sheldon, P. D. Nation, and J. M. Gambetta, “Validating quantum computers using randomized model circuits,” Phys. Rev. A, vol. 100, p. 032328, Sep 2019.
[23] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, Petersen, A. Sadik, I. Antonoglou, King, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518, p. 529–533, Feb 2015.
[24] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” 2013. arXiv:1312.5602.
[25] N. Quetschlich, L. Burgholzer, and R. Wille, “MQT Bench: Benchmarking software and design automation tools for quantum computing,” Quantum, vol. 7, p. 1062, Jul 2023.
[26] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Yang, M. I. Jordan, and I. Stoica, “Ray: A distributed framework for emerging AI applications,” in Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), (Carlsbad, CA, USA), pp. 561–577, Oct 2018.
[27] K. De Asis, J. F. Hernandez-Garcia, G. Z. Holland, and R. S. Sutton, “Multi-step reinforcement learning: a unifying algorithm,” in Proceedings of the 32nd AAAI Conference on Artificial Intelligence and 30th Innovative Applications of Artificial Intelligence Conference and 8th AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI Press, 2018.
[28] M. G. Bellemare, W. Dabney, and R. Munos, “A distributional perspective on reinforcement learning,” in Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, p. 449–458, JMLR, 2017.
[29] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” in Proceedings of the 4th International Conference on Learning Representations, ICLR 2016, (San Juan, Puerto Rico), 2016.
[30] M. Fortunato, M. G. Azar, B. Piot, J. Menick, I. Osband, A. Graves, V. Mnih, R. Munos, D. Hassabis, O. Pietquin, C. Blundell, and S. Legg, “Noisy networks for exploration,” in Proceedings of the 6th International Conference on Learning Representations, ICLR 2018, (Vancouver, Canada), 2018.
[31] R. Liaw, E. Liang, R. Nishihara, P. Moritz, J. E. Gonzalez, and I. Stoica, “Tune: A research platform for distributed model selection and training,” 2018. Presented at the 2018 ICML AutoML workshop.
[32] H. T. Nguyen, M. Usman, and R. Buyya, “iQuantum: A toolkit for modeling and simulation of quantum computing environments,” Software: Practice and Experience, vol. 54, no. 6, pp. 1141–1171, 2024.
[33] Qiskit contributors, “Qiskit: An open-source framework for quantum computing,” 2023. https://github.com/Qiskit/qiskit.
[34] E. Liang, R. Liaw, R. Nishihara, Moritz, K. Goldberg, J. Gonzalez, and Jordan, “RLlib: Abstractions for distributed reinforcement learning,” in Proceedings of the 35th International Conference on Machine Learning, vol. 80, pp. 3053–3062, PMLR, 10–15 Jul 2018.
[35] Y. Fan, B. Li, D. Favorite, N. Singh, T. Childers, P. Rich, W. Allcock, M. E. Papka, and Z. Lan, “Dras: Deep reinforcement learning for cluster scheduling in high performance computing,” IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 12, pp. 4903–4917, 2022.