Denoising Gradient Descent in Variational Quantum Algorithms

Lars Simon
Bundesdruckerei GmbH
[email protected]                Holger Eble
                Bundesdruckerei GmbH
               [email protected] Hagen-Henrik Kowalski
Bundesdruckerei GmbH
[email protected]      Manuel Radons
      Bundesdruckerei GmbH
      [email protected]

(March 2024)

Abstract

In this article we introduce an algorithm for mitigating the adverse effects of noise on gradient descent in variational quantum algorithms. This is accomplished by computing a regularized local classical approximation to the objective function at every gradient descent step. The computational overhead of our algorithm is entirely classical, i.e., the number of circuit evaluations is exactly the same as when carrying out gradient descent using the parameter-shift rules. We empirically demonstrate the advantages offered by our algorithm on randomized parametrized quantum circuits.

1 Introduction

In variational quantum algorithms (VQAs) a (typically gradient-based) classical optimizer is used to train a parametrized quantum circuit. While there is evidence that the presence of noise can be helpful for avoiding saddle points in VQAs [LWM ${}^{+}$ 23], noise is generally detrimental to their performance [FFR ${}^{+}$ 21], [WFC ${}^{+}$ 21]. In line with this, several techniques for mitigating the effect of noise in quantum algorithms (and VQAs in particular) have been proposed, see, e.g., [LB17], [TBG17], [CACC21], [vdBMT22], [UNP ${}^{+}$ 21], [RSPC23].

In this article we introduce an algorithm for mitigating the adverse effects of noise on gradient descent in VQAs. As is the case for the algorithm introduced in [RSPC23], the error mitigating techniques are applied in real-time, i.e., during execution of gradient descent. The idea is, roughly speaking, to compute an approximation to the objective function at every gradient descent step. Computation of such approximations is facilitated by the fact that the set of possible objective functions naturally embeds into a certain reproducing kernel Hilbert space, whose structure can be exploited. Non-surprisingly, the quality of the approximation will in general not be good on the entire parameter space – otherwise we could forego the need for a quantum device and simply work with the classical approximation instead. However, at every gradient descent step, we are guaranteed to obtain a good approximation locally around the current point in parameter space, see Remark 1 below. The benefit of computing a local approximation is that samples from past iterations can be taken into account in order to make the approximation more robust to noise. Moreover, computing an approximation allows us to use regularization techniques from classical machine learning.

Our method is agnostic to the type of noise that evaluation of the objective function is subjected to; as a result, our algorithm can be seen as a general purpose method with applications in a variety of settings. However, this agnosticism comes with some disadvantages, the most obvious one being that our algorithm might fail to mitigate certain types of noise. Moreover, for specific types of noise, our algorithm is likely to be outperformed by specialized methods. The advantages and drawbacks of our algorithm are discussed in more detail in Section 4.

Gradients in VQAs are usually calculated using the so-called parameter-shift rules [MNKF18], [SBG ${}^{+}$ 19], [MBK21], [WIWL22] (although the computational cost of the latter scales very unfavourably with the number of trainable parameters, see [BWP23], [AKH ${}^{+}$ 23]). The computational overhead of our algorithm is entirely classical, i.e., the number of circuit evaluations is exactly the same as when carrying out gradient descent using the parameter-shift rules.

This article is organized as follows. In Section 2 we introduce our algorithm, which includes a description in pseudocode, see Algorithm 1. In Section 3 we analyse our algorithm experimentally on a large number of randomized parametrized quantum circuits, considering both measurement shot noise and (simulated) quantum hardware noise by using some of the fake backends (designed to mimic the behavior of IBM Quantum systems) provided by the Qiskit framework. In Section 4 we discuss the advantages and the drawbacks of our algorithm. Finally, in Section 5, we make our concluding remarks and point out some possible directions for future research.

1.1 Acknowledgement

This article was written as part of the Qu-Gov project, which was commissioned by the German Federal Ministry of Finance. The authors want to extend their gratitude to Kim Nguyen, Manfred Paeschke, Oliver Muth, Andreas Wilke, and Yvonne Ripke for their continuous encouragement and support.

2 The Algorithm

In Section 2.1 we describe the setting for our algorithm and the problem at hand, and in Section 2.2 we describe the algorithm we developed to tackle the problem.

2.1 The Setting

We let $n$ be a positive integer and, for $\theta\in\mathbb{R}^{m}$ , consider the unitary

\displaystyle U(\theta)=C_{m+1}R_{m}(\theta_{m})C_{m}\cdots R_{2}(\theta_{2})C% _{2}R_{1}(\theta_{1})C_{1}\in\mathbb{C}^{2^{n}\times 2^{n}},

where $m$ is a non-negative integer, $C_{1},\dots,C_{m+1}$ are unitaries given by $n$ -qubit quantum circuits and, for all $j\in\{1,\dots,m\}$ , the unitary $R_{j}(\theta_{j})$ is a rotation of the form

\displaystyle R_{j}(\theta_{j})=\exp\left(-i\frac{\theta_{j}}{2}G_{j}\right)% \in\mathbb{C}^{2^{n}\times 2^{n}}

for some Hermitian $G_{j}\in\mathbb{C}^{2^{n}\times 2^{n}}$ whose set of Eigenvalues is $\{-1,1\}$ . For example, each $G_{j}$ could be a tensor product $P_{1}\otimes\cdots\otimes P_{n}\neq I^{\otimes n}$ of Pauli matrices $P_{1},\dots,P_{n}\in\{I,X,Y,Z\}$ . Moreover, we consider an observable given by a Hermitian matrix $\mathcal{M}\in\mathbb{C}^{2^{n}\times 2^{n}}$ . Letting $|\psi(\theta)\rangle:=U(\theta)|0\rangle^{\otimes n}$ , we consider the expected value

\displaystyle f(\theta):=\langle\psi(\theta)|\mathcal{M}|\psi(\theta)\rangle,

which defines a function $f\colon\mathbb{R}^{m}\to\mathbb{R}$ . The aim is now to minimize $f$ using gradient descent. Gradients of $f$ are typically calculated using the so-called parameter-shift rules [MNKF18], [SBG ${}^{+}$ 19], [MBK21], which reduce computation of the gradient to $2m$ evaluations of $f$ . However, evaluation of $f$ is subject to both measurement shot noise and hardware quantum noise, since it involves the execution of quantum circuits on a quantum device. As a result, gradients obtained in this way are themselves noisy, which is detrimental to performance of the gradient descent algorithm.

2.2 Denoised Gradient Descent Algorithm

As shown in [SSM21], we can write

\displaystyle f(\theta)=\sum_{\omega\in\{-1,0,1\}^{m}}c_{\omega}e^{i\omega^{t}% \theta}\text{ for all }\theta\in\mathbb{R}^{m},

where $c_{\omega}\in\mathbb{C}$ and $c_{\omega}=\overline{c_{-\omega}}$ for all $\omega\in\{-1,0,1\}^{m}$ . We denote the set of all functions of this form as $H$ and give $H$ the structure of a reproducing kernel Hilbert space with reproducing kernel $K$ by defining

\displaystyle\langle g_{1},g_{2}\rangle_{H}=\int_{[-\pi,\pi]^{m}}g_{1}(z)g_{2}% (z)\mathrm{d}z,\ g_{1},g_{2}\in H,

and

\displaystyle K(x,z)=\frac{1}{(2\pi)^{m}}\prod_{j=1}^{m}(1+2\cos(x_{j}-z_{j}))% ,\ x,z\in\mathbb{R}^{m}.

For the sake of numerical stability we will consider $\tilde{K}$ instead of $K$ in our algorithms, where the former is given by

\displaystyle\tilde{K}(x,z)=\prod_{j=1}^{m}\frac{1+2\cos(x_{j}-z_{j})}{3},\ x,% z\in\mathbb{R}^{m}.

A more rigorous treatment of the above can be found in [SEKR23].

We will now describe the algorithm. As is the case for the gradient descent algorithm, we start out with an initial point $\theta_{0}\in\mathbb{R}^{m}$ , a learning rate $\alpha>0$ , and a number of steps $T\in\mathbb{Z}_{\geq 1}$ . In addition, we choose a regularization hyperparameter $\lambda>0$ and, optionally, an integer $\ell\geq 1$ for bounding the size of the linear systems of equations we will have to solve. In practice we may, of course, change $\alpha$ , $\lambda$ , and $\ell$ from iteration to iteration and combine our algorithm with, for example, normalized gradient descent [HLSS15], Nesterov’s Accelerated Gradient method [Nes83], or the ADAM optimizer [KB15], but we will not discuss these options here for the sake of simplicity (for a treatment of these in the context of variational quantum algorithms, see [SYRY21]).

Now assume that $t\in\mathbb{Z}$ , $1\leq t\leq T$ and that we have already carried out $t-1$ steps, which yielded points $\theta_{0},\dots,\theta_{t-1}\in\mathbb{R}^{m}$ in parameter space, and, for $s\in\{0,1,\dots,t-2\}$ , (noisy) samples

\displaystyle g^{(s)}_{j,\nu}\approx f\left(\theta_{s}+\nu\frac{\pi}{2}\mathbf% {e}_{j}\right),\ j\in\{1,\dots,m\},\nu\in\{-1,1\},

obtained from evaluating $f\left(\theta_{s}+\nu\frac{\pi}{2}\mathbf{e}_{j}\right)$ on a quantum device (note that $\{0,1,\dots,t-2\}$ is the empty set when $t=1$ , i.e., in this case we merely start out with $\theta_{0}$ ). Here, $\mathbf{e}_{j}\in\mathbb{R}^{m}$ denotes the $j^{\text{th}}$ canonical basis vector. We now start the $t^{\text{th}}$ step of the algorithm by first obtaining (noisy) samples $g^{(t-1)}_{j,\nu}\approx f\left(\theta_{t-1}+\nu\frac{\pi}{2}\mathbf{e}_{j}\right)$ , $j\in\{1,\dots,m\}$ , $\nu\in\{-1,1\}$ , through evaluation on a quantum device. Instead of using these samples to get a noisy estimate for the gradient $\nabla f(\theta_{t-1})$ by way of the parameter-shift rules, we will instead compute a classical approximation $\tilde{f}_{t}$ of $f$ and then obtain $\theta_{t}$ by carrying out a gradient descent step with respect to $\tilde{f}_{t}$ . Non-surprisingly, the quality of the approximation will in general not be good on the entire parameter space $\mathbb{R}^{m}$ – otherwise we could forego the need for a quantum device and simply work with the classical approximation instead. However, we are guaranteed to obtain a good estimate for the gradient of $f$ at the point $\theta_{t-1}$ , see Remark 1.

In order to construct the approximation $\tilde{f}_{t}$ , we will make use of the (noisy) samples of $f$ we have obtained so far. For ease of notation, we denote the collection of points $\left(\theta_{s}+\nu\frac{\pi}{2}\mathbf{e}_{j}\right)_{s,j,\nu}$ as $p^{(t)}_{1},\dots,p^{(t)}_{D_{t}}$ and the corresponding noisy samples $\left(g^{(s)}_{j,\nu}\right)_{s,j,\nu}$ as $v^{(t)}_{1},\dots,v^{(t)}_{D_{t}}$ (the order in which we list the points does not have an impact on the function $\tilde{f}_{t}$ we are constructing). Here, $j\in\{1,\dots,m\}$ , $\nu\in\{-1,1\}$ , and $s$ ranges from $0$ to $t-1$ if the optional input $\ell$ was not provided. If the optional input $\ell$ was provided, we will discard some of the older samples in order to ensure that it does not become infeasible to compute the approximation $\tilde{f}_{t}$ . In this case, $s$ ranges from $\max\{0,t-\ell\}$ to $t-1$ . We now consider the linear system of equations

\displaystyle\left(\left(\tilde{K}(p^{(t)}_{k},p^{(t)}_{l})\right)_{1\leq k,l% \leq D_{t}}+\lambda I_{D_{t}\times D_{t}}\right)\cdot\eta=\begin{pmatrix}v^{(t% )}_{1}\\ \vdots\\ v^{(t)}_{D_{t}}\end{pmatrix},\text{ where }\eta\in\mathbb{R}^{D_{t}},

where $I_{D_{t}\times D_{t}}$ denotes the ${D_{t}\times D_{t}}$ identity matrix. Since $\tilde{K}$ only differs from the symmetric positive semidefinite kernel $K$ by multiplication with a positive constant, the matrix appearing in this linear system of equation is strictly positive definite (recall that $\lambda>0$ ) and hence invertible. This implies that there exists a uniquely determined solution $\eta^{(t)}\in\mathbb{R}^{D_{t}}$ . We now define $\tilde{f}_{t}\colon\mathbb{R}^{m}\to\mathbb{R}$ by

\displaystyle\tilde{f}_{t}(\theta)=\sum_{k=1}^{D_{t}}\eta^{(t)}_{k}\tilde{K}(p% ^{(t)}_{k},\theta)\text{ for all }\theta\in\mathbb{R}^{m}.

Note that $\tilde{f}_{t}$ can be evaluated on a classical device. Since $\tilde{f}_{t}\in H$ , the parameter-shift rules apply (see Lemma 10 in [SEKR23]), and hence we get, for all $j\in\{1,\dots,m\}$ :

\displaystyle\partial_{j}\tilde{f}_{t}(\theta_{t-1})=\frac{\tilde{f}_{t}\left(% \theta_{t-1}+\frac{\pi}{2}\mathbf{e}_{j}\right)-\tilde{f}_{t}\left(\theta_{t-1% }-\frac{\pi}{2}\mathbf{e}_{j}\right)}{2},

which allows us to compute $\nabla\tilde{f}_{t}(\theta_{t-1})$ . Alternatively, $\nabla\tilde{f}_{t}(\theta_{t-1})$ can be computed using the explicit expression for $\tilde{K}$ given above. We now obtain $\theta_{t}$ as

\displaystyle\theta_{t}=\theta_{t-1}-\alpha\cdot\nabla\tilde{f}_{t}(\theta_{t-% 1}).

The use of the regularization parameter $\lambda$ also has an impact on the length of the gradient vector $\nabla\tilde{f}_{t}(\theta_{t-1})$ and we are, arguably, more interested in denoising the direction of the estimate for the gradient vector of $f$ . Because of this, we may optionally want to force the step length to be the same as it would be when using the noisy estimate for the gradient of $f$ . One way of doing this would be to use the alternative descent step $\theta_{t}=\theta_{t-1}-\alpha_{t}\cdot\nabla\tilde{f}_{t}(\theta_{t-1})$ , where

\displaystyle\alpha_{t}=\frac{\left\|\frac{1}{2}\begin{pmatrix}g^{(t-1)}_{1,1}% -g^{(t-1)}_{1,-1}\\ \vdots\\ g^{(t-1)}_{m,1}-g^{(t-1)}_{m,-1}\end{pmatrix}\right\|+\epsilon}{\left\|\nabla% \tilde{f}_{t}(\theta_{t-1})\right\|+\epsilon}\cdot\alpha.

Here, $\|\cdot\|$ denotes the Euclidean norm and we introduced a small constant $\epsilon>0$ for the sake of numerical stability and in order to avoid division by $0$ . This concludes our description of the algorithm. Note that the computational overhead of our algorithm is entirely classical; our algorithm needs exactly as many circuit evaluations as gradient descent.

The intuition behind the algorithm is explained in Remark 1. A compact description of our algorithm in pseudocode can be found in Algorithm 1.

Hyperparameters : learning rate

\alpha>0

, regularization

\lambda>0

, optional: bound

\ell\in\mathbb{Z}_{\geq 1}

on number of iterations to consider when constructing approximation, optional:

\epsilon>0

for numerical stability when rescaling gradient

Input : initial point

\theta_{0}\in\mathbb{R}^{m}

, number of steps

T\in\mathbb{Z}_{\geq 1}

, function

f\colon\mathbb{R}^{m}\to\mathbb{R}

f(\theta)=\langle 0^{n}|U^{\dagger}(\theta)\mathcal{M}U(\theta)|0^{n}\rangle

(see Section 2.1)

Output : point

\theta_{T}\in\mathbb{R}^{m}

in parameter space

1 if optional hyperparameter $\ell$ not provided then

2 Set

\ell:=T

end if

3for $t=1,\dots,T$ do

4 for $j=1,\dots,m$ do

5 Obtain (noisy) samples

\displaystyle g^{(t-1)}_{j,1}\approx f\left(\theta_{t-1}+\frac{\pi}{2}\mathbf{% e}_{j}\right),

\displaystyle g^{(t-1)}_{j,-1}\approx f\left(\theta_{t-1}-\frac{\pi}{2}\mathbf% {e}_{j}\right),

by evaluating

f

at the respective points using a quantum device

end for

109876Assemble the collections

\displaystyle\left(\theta_{s}+\nu\frac{\pi}{2}\mathbf{e}_{j}\right)_{s,j,\nu},

\displaystyle\left(g^{(s)}_{j,\nu}\right)_{s,j,\nu}

into tuples

\left({p^{(t)}_{1},\dots,p^{(t)}_{D_{t}}}\right)

and

\left({v^{(t)}_{1},\dots,v^{(t)}_{D_{t}}}\right)

respectively, in arbitrary (but matching) order, where

j\in\{1,\dots,m\}

\nu\in\{-1,1\}

, and

s\in\{\max\{0,t-\ell\},\dots,t-1\}

Find the uniquely determined

\eta^{(t)}\in\mathbb{R}^{D_{t}}

solving the linear system of equations

\displaystyle\left(\left(\tilde{K}(p^{(t)}_{k},p^{(t)}_{l})\right)_{1\leq k,l% \leq D_{t}}+\lambda I_{D_{t}\times D_{t}}\right)\cdot\eta=\begin{pmatrix}v^{(t% )}_{1}\\ \vdots\\ v^{(t)}_{D_{t}}\end{pmatrix},\text{ where }\eta\in\mathbb{R}^{D_{t}}

Set

\tilde{f}_{t}(\theta)=\sum_{k=1}^{D_{t}}\eta^{(t)}_{k}\tilde{K}(p^{(t)}_{k},\theta)

Compute the gradient

\displaystyle\nabla\tilde{f}_{t}(\theta_{t-1})=\left({\frac{\tilde{f}_{t}\left% (\theta_{t-1}+\frac{\pi}{2}\mathbf{e}_{j}\right)-\tilde{f}_{t}\left(\theta_{t-% 1}-\frac{\pi}{2}\mathbf{e}_{j}\right)}{2}}\right)_{j=1,\dots,m}

(alternatively,

\nabla\tilde{f}_{t}(\theta_{t-1})

can be computed using the explicit expression for

\tilde{K}

) if optional hyperparameter $\epsilon$ not provided then

11Set

\alpha_{t}:=\alpha

(in this case we do not rescale the gradient)

end if

12else

13Set

\alpha_{t}:=\frac{\left\|\frac{1}{2}\begin{pmatrix}g^{(t-1)}_{1,1}-g^{(t-1)}_{% 1,-1}\\ \vdots\\ g^{(t-1)}_{m,1}-g^{(t-1)}_{m,-1}\end{pmatrix}\right\|+\epsilon}{\left\|\nabla% \tilde{f}_{t}(\theta_{t-1})\right\|+\epsilon}\cdot\alpha

end if

14Set

\theta_{t}:=\theta_{t-1}-\alpha_{t}\cdot\nabla\tilde{f}_{t}(\theta_{t-1})

end for

15return

\theta_{T}

Algorithm 1 Denoised Gradient Descent

Remark 1.

In order to understand the significance the function $\tilde{f}_{t}$ , it is instructive to consider the hypothetical scenario where $\lambda$ is replaced by $0$ and where all samples we obtained are completely noiseless, i.e., coincide with the exact values of $f$ at the respective points. In this case, the above linear system of equations reduces to

\displaystyle\left(\tilde{K}(p^{(t)}_{k},p^{(t)}_{l})\right)_{1\leq k,l\leq D_% {t}}\cdot\eta=\begin{pmatrix}f(p^{(t)}_{1})\\ \vdots\\ f(p^{(t)}_{D_{t}})\end{pmatrix},\text{ where }\eta\in\mathbb{R}^{D_{t}}.

As was shown in the appendix of [SEKR23] (see also Implementation Remark 6 in this reference), the linear system of equations still has a solution $\eta^{(t)}\in\mathbb{R}^{D_{t}}$ in this setting, and the function $\tilde{f}_{t}$ , defined as above, agrees with $f$ on $\{p^{(t)}_{1},\dots,p^{(t)}_{D_{t}}\}$ . In particular, since the parameter-shift rules apply to both $f$ and $\tilde{f}_{t}$ (since both are $\in H$ ), we have for all $j\in\{1,\dots,m\}$ :

\displaystyle\partial_{j}f(\theta_{t-1})=\frac{f\left(\theta_{t-1}+\frac{\pi}{% 2}\mathbf{e}_{j}\right)-f\left(\theta_{t-1}-\frac{\pi}{2}\mathbf{e}_{j}\right)% }{2}=\frac{\tilde{f}_{t}\left(\theta_{t-1}+\frac{\pi}{2}\mathbf{e}_{j}\right)-% \tilde{f}_{t}\left(\theta_{t-1}-\frac{\pi}{2}\mathbf{e}_{j}\right)}{2}=% \partial_{j}\tilde{f}_{t}(\theta_{t-1}).

It follows that $\nabla f(\theta_{t-1})=\nabla\tilde{f}_{t}(\theta_{t-1})$ . So, in the noiseless case without regularization hyperparameter, we exactly recover the gradient of $f$ at $\theta_{t-1}$ .

In the noisy case, we need to introduce a small $\lambda>0$ in order to ensure that the linear system of equations still has a solution. In this case, the function $\tilde{f}_{t}$ corresponds to a solution to the optimization problem underlying kernel ridge regression [Vov13] with kernel $\tilde{K}$ , regularization hyperparameter $\lambda$ , and data $\left((p_{k}^{(t)},v_{k}^{(t)})\right)_{k=1,\dots,D_{t}}$ . One can hope that, in the spirit of Tikhonov regularization [HK70], the regularization hyperparameter $\lambda$ helps mitigate the effect that noise (stemming from the evaluation of $f$ on a quantum device) has on the solution to the above linear system of equations. Moreover, this approach is very natural, since the true function $f$ is known to be contained in $H$ , the reproducing kernel Hilbert space associated to the kernel $K$ , which only differs from $\tilde{K}$ by multiplication with a positive constant. However, the main benefit of our method stems from the fact that (when $\ell\geq 2$ or the optional input $\ell$ was not provided), samples from past iterations are used to improve the quality of the classical approximation $\tilde{f}_{t}$ . While one would not expect this to be beneficial in the noiseless case (we recover the exact gradient using the samples from the current iteration, see above), it is plausible that, in the noisy case, the increased number of samples involved in the approximation process will reduce the effect of noise on the gradient estimate provided by our approximation. We give numerical evidence for this in Section 3; a thorough theoretical analysis is left for future work.

3 Experiments

In this section we describe our experiments with Algorithm 1. In Section 3.1 we will randomly sample parametrized quantum circuits and points in parameter space and compare the quality of the noisy gradient to that of the denoised gradient computed by Algorithm 1. Here, quality is measured in terms of cosine similarity to the exact gradient vector, computed via statevector simulation. In Section 3.2 we randomly sample a parametrized quantum circuit and an initial point in parameter space and compare the descent of the objective function between many executions of noisy gradient descent and Algorithm 1 respectively.

In all our experiments, points in parameter space were randomly sampled from the uniform distribution on $[0,2\pi)^{m}$ and we worked with the fixed observable $\mathcal{M}=Z^{\otimes n}$ and initial state $|0\rangle^{\otimes n}$ . Informally speaking, we created random circuits by sandwiching random parametrized n-qubit Pauli rotations between random unitaries. More precisely, with the notation from Section 2.1, $G_{1},\dots,G_{m}$ were randomly sampled (independently) from the uniform distribution on $\{I,X,Y,Z\}^{\otimes n}\setminus\{I^{\otimes n}\}$ . In order to keep the circuit depth manageable, the unitaries $C_{1},\dots,C_{m+1}$ were not sampled from the Haar measure on the unitary group $U(2^{n})$ . Instead, they were randomly sampled (independently) as follows: For $j\in\{1,\dots,m+1\}$ we first chose a permutation $\tau_{j}\in S_{n}$ uniformly at random and subsequently (independently) sampled unitaries $U^{(j)}_{1},\dots,U^{(j)}_{\left\lfloor{n/2}\right\rfloor}$ from the Haar measure on $SU(4)$ . The unitary $C_{j}$ was then obtained by applying $U^{(j)}_{1},\dots,U^{(j)}_{\left\lfloor{n/2}\right\rfloor}$ to the qubit pairs $(\tau_{j}(1),\tau_{j}(2)),\dots,(\tau_{j}(2{\left\lfloor{n/2}\right\rfloor}-1)% ,\tau_{j}(2{\left\lfloor{n/2}\right\rfloor}))$ respectively. Note that this is precisely how the individual layers in the quantum volume test [CBS ${}^{+}$ 19] are sampled. For a visual representation of the circuits featuring in our experiments, see Figure 1.

Refer to caption — Figure 1: This figure shows the random circuits used in the experiments in Section 3. Here, $C_{1},\dots,C_{m+1}$ are randomly sampled just like the individual layers in the quantum volume test [CBS ${}^{+}$ 19]. Moreover, $G_{1},\dots,G_{m}$ are randomly sampled non-identity Pauli strings and the measurement observable is always $\mathcal{M}=Z^{\otimes n}$ . For a more rigorous description of these circuits, see the beginning of Section 3.

3.1 Alignment with Exact Gradient Vector

Here we describe the experiments we carried out to compare the respective alignment of the noisy gradient and the denoised gradient with the exact gradient vector. In all experiments we proceeded as follows: We decided on a number of samples $N$ and fixed $n$ , $m$ , the number of shots per circuit evaluation, and the learning rate $\alpha=0.1$ . For each combination of $\ell$ and quantum backend featuring in the experiment, we then repeated the following $N$ (number of samples) times:

1.

A circuit and a point $\theta_{0}\in\mathbb{R}^{m}$ are sampled randomly as explained in the beginning of Section 3,
2.

Algorithm 1 is executed for $T=\ell$ steps (including gradient rescaling), and the last (denoised) gradient computed during the algorithm (at point $\theta_{\ell-1}$ ) is denoted as $w_{1}\in\mathbb{R}^{m}$ ,
3.

The (noisy) gradient $w_{2}\in\mathbb{R}^{m}$ is computed at point $\theta_{\ell-1}$ by way of the parameter-shift rules,
4.

Denoting the exact gradient at $\theta_{\ell-1}$ , computed via statevector simulation, as $w\in\mathbb{R}^{m}$ , the cosine similarities $x_{j}=\frac{w^{t}w_{j}}{\|w\|\|w_{j}\|}$ between $w_{j}$ and $w$ , where $j\in\{1,2\}$ , are computed (the denominator is artificially bounded from below by a positive constant for the sake of numerical stability and in order to avoid division by zero),
5.

The point $(x_{1},x_{2})\in\mathbb{[}-1,1]\times[-1,1]$ is plotted in a coordinate system.

We then end up with a scatter plot of $N$ points. The points below the diagonal $\{x=y\}$ correspond to outcomes where the denoised gradient obtained from Algorithm 1 was closer to the exact gradient than the noisy gradient (where closeness is measured in terms of cosine similarity; this is a sensible similarity measure, since the denoised gradient was rescaled to the length of the noisy gradient).

Note that, for each of the $N$ samples, both circuit and point in parameter space are sampled randomly (independently), i.e., there is a large number of different circuits appearing in each experiment.

We carry out two such experiments: In Section 3.1.1 we focus on the effect of measurement shot noise, whereas in Section 3.1.2 we focus on (simulated) quantum hardware noise.

3.1.1 Measurement Shot Noise

For this experiment we use $N=500$ samples and set $n=8$ , $m=8$ , $\lambda=0.28$ . Further, we set the number of measurement shots per circuit to $200$ . Since we want to exclusively focus on measurement shot noise in this experiment, we choose the AerSimulator (without noise model) provided by the Qiskit framework as the quantum backend for this experiment. We then carry out the procedure outlined in the beginning of Section 3.1 with $\ell=1,\dots,6$ , obtaining six scatter plots, see Figure 2.

For $\ell=1,\dots,6$ , the denoised gradient obtained from Algorithm 1 outperforms the noisy gradient in 50.0%, 72.8%, 82.6%, 88.6%, 91.8%, 93.0% of the cases respectively.

3.1.2 Quantum Hardware Noise

For this experiment we use $N=250$ samples and set $n=5$ , $m=8$ , $\lambda=0.04$ , and $\ell=5$ . In order to suppress the effect of measurement shot noise, we set the number of measurement shots per circuit to $10000$ . We then carry out the procedure outlined in the beginning of Section 3.1 with the (simulated) quantum hardware backends FakeVigoV2, FakeNairobiV2, FakeCairoV2, FakeBrooklynV2, FakeWashingtonV2 provided by the Qiskit framework. In order to demonstrate that measurement shot noise plays a negligible role in this experiment, we also carry out the above-mentioned procedure with the AerSimulator (without noise model) provided by the Qiskit framework. We thus obtain a total of six scatter plots, see Figure 3.

As expected, for the AerSimulator (without noise model), all noisy and denoised gradients were very close to the exact gradient. For the (simulated) quantum hardware backends FakeVigoV2, FakeNairobiV2, FakeCairoV2, FakeBrooklynV2, FakeWashingtonV2, the denoised gradient obtained from Algorithm 1 outperforms the noisy gradient in 92.4%, 91.6%, 92.4%, 89.6%, 91.6% of the cases respectively.

3.2 Descent of Objective Function

In order to keep the computational load manageable, we set $n=4$ , $m=4$ , $\ell=5$ , learning rate $\alpha=0.4$ , number of steps $T=60$ . We further set the number of measurement shots per circuit to $50$ . We then randomly sample a single circuit and initial point $\theta_{0}\in\mathbb{R}^{m}$ in parameter space as described in the beginning of Section 3. For each combination of regularization hyperparameter $\lambda$ and quantum backend featuring in this experiment, we then repeat the following $N=100$ times:

1.

Execute Algorithm 1 with initial point $\theta_{0}$ (including gradient rescaling),
2.

Execute noisy gradient descent with initial point $\theta_{0}$ using the parameter-shift rules with the same number of steps ( $T=60$ ), the same learning rate ( $\alpha=0.4$ ), and the same number of measurement shots per circuit ( $50$ ).

For both algorithms we thus obtain $N=100$ sequences of points $(\theta_{0},\dots,\theta_{60})$ in parameter space. For the sake of better comparability we then use statevector simulation to evaluate the exact values of $f$ (see Section 2.1) at these points, which, for both algorithms, yields $N=100$ sequences of values $(f(\theta_{0}),\dots,f(\theta_{60}))$ . It is important to point out that exact evaluations of the function $f$ were not employed during the execution of either algorithm, but rather after the executions of the algorithms in order to compare the results. I.e., exact evaluations of $f$ did not have an impact on the execution of either algorithm.

For both algorithms we then computed the component-wise average of the $N=100$ sequences of values, giving precisely one sequence of values of length $T+1=61$ for each algorithm. These two sequences can be interpreted as the respective average performance of Algorithm 1 and noisy gradient descent. For comparison sake we then computed a corresponding sequence of $61$ points using exact gradient descent (same initial point, number of steps, and learning rate as for noisy gradient descent, but the gradients were computed via the parameter-shift rules using statevector simulation).

This procedure was carried out for each of the $9$ combinations of regularization hyperparameters $\lambda=4/\sqrt{50}$ , $\lambda=0.01/\sqrt{50}$ , and $\lambda=0.001/\sqrt{50}$ with (simulated) quantum hardware backends FakeVigoV2, FakeNairobiV2, FakeCairoV2. The results are visualized in Figure 4.

The results shown in Figure 4 validate our algorithm. However, as expected, there is some indication that the optimal choice for the regularization hyperparameter $\lambda$ might be device-dependent. Non-surprisingly, the results seem to further indicate that – for optimal performance – $\lambda$ should be adjusted over the course of the algorithm (e.g., based on the length of the noisy gradient vector).

4 Discussion

In this section we will discuss both the advantages and the drawbacks of our algorithm compared to (noisy) gradient descent.

The obvious advantage is that, in many scenarios in the context of variational quantum algorithms, our denoised gradient descent algorithm is able to significantly accelerate the descent of the objective function and to improve the alignment of the estimated gradient vector with the exact gradient vector when compared to noisy gradient descent. Several scenarios where this is indeed the case were explored in Section 3. Moreover, the computational overhead of our algorithm is entirely classical – the number of circuit evaluations is exactly the same as when executing (noisy) gradient descent using the parameter-shift rules.

However, the denoised gradient descent algorithm comes with some caveats, the most obvious one being that, since our algorithm makes use of samples from past iterations, it is not really suitable for variational quantum algorithms which take data as input, i.e., for those variational quantum algorithms whose corresponding ansatz has parametrized gates corresponding to both inputs and trainable parameters (our algorithm would still prove beneficial if one was to carry out several consecutive gradient descent steps with the same mini batch of training data, but this is a niche application that we will not consider here). Instead, our algorithm is well-suited for variational quantum algorithms whose ansatz only contains parametrized gates corresponding to trainable parameters; the most prominent examples of the latter are variational quantum eigensolvers [PMS ${}^{+}$ 14].

There are also caveats regarding the performance resp. feasibility of the algorithm. For example, if the number of trainable parameters $m$ or the number of iterations $\ell$ to consider when computing the approximation becomes too large, it might become infeasible to solve the linear system of equations appearing in Algorithm 1: In each iteration $t\geq\ell$ , the latter will be a square linear system of equations with $2m\ell$ unknowns resp. equations. Furthermore, while we expect it to be straightforward to establish a rigorous advantage in the case of measurement shot noise (under reasonable assumptions), one cannot expect our algorithm to offer a tangible advantage for all kinds of hardware quantum noise – a detailed analysis is left for future work. In this context it is also important to mention that the optimal choice of the hyperparameter $\lambda>0$ does not only depend on the number of measurement shots per circuit, but also on the quantum device on which the circuits are executed. As such, good heuristics for the choice of $\lambda$ will necessarily be device-dependent. Finding good heuristics for the choice of $\lambda$ is further complicated by the fact that – for optimal performance – $\lambda$ should be adjusted over the course of the algorithm (e.g., based on the length of the noisy gradient vector).

Finally, we mention that the advantage offered by our algorithm might disappear if the learning rate is chosen too large. This is because the quality of the (local) approximation computed by our algorithm declines if the points at which the objective function $f$ is sampled are spaced further apart.

When weighing the advantages and drawbacks outlined above, we believe that there are some scenarios with practical relevance where using our algorithm would prove advantageous.

5 Conclusion and Outlook

In this article we introduced the denoised gradient descent algorithm, which mitigates the effect of noise on gradient descent in variational quantum algorithms. We explored the capabilities of the algorithm experimentally and discussed its advantages and drawbacks. Potential topics for future work include, but are not limited to, the following:

•

deriving rigorous performance guarantees for the algorithm under suitable assumptions on the noise,
•

thoroughly analyzing our algorithm for different types of noise channels,
•

deriving good device-dependent heuristics for the choice of the hyperparameter $\lambda>0$ (including heuristics for adjusting $\lambda$ over the course of the algorithm),
•

determining how large the learning rate can be chosen without the quality of the local approximation deteriorating to the point where using our algorithm is no longer advantageous,
•

investigating whether our algorithm can be used to reduce the amount of measurement shots necessary to succesfully carry out gradient descent in variational quantum algorithms, see for example [SAPM23]. The experimental results in Section 3 seem to indicate that this is possible – however, further studies are needed to verify this.

References

[AKH ${}^{+}$ 23] Amira Abbas, Robbie King, Hsin-Yuan Huang, William J. Huggins, Ramis Movassagh, Dar Gilboa, and Jarrod Ryan McClean. On quantum backpropagation, information reuse, and cheating measurement collapse. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
[BWP23] Joseph Bowles, David Wierichs, and Chae-Yeun Park. Backpropagation scaling in parameterised quantum circuits, 2023.
[CACC21] Piotr Czarnik, Andrew Arrasmith, Patrick J. Coles, and Lukasz Cincio. Error mitigation with Clifford quantum-circuit data. Quantum, 5:592, November 2021.
[CBS ${}^{+}$ 19] Andrew W. Cross, Lev S. Bishop, Sarah Sheldon, Paul D. Nation, and Jay M. Gambetta. Validating quantum computers using randomized model circuits. Phys. Rev. A, 100:032328, Sep 2019.
[FFR ${}^{+}$ 21] Enrico Fontana, Nathan Fitzpatrick, David Muñoz Ramo, Ross Duncan, and Ivan Rungger. Evaluating the noise resilience of variational quantum algorithms. Phys. Rev. A, 104:022403, Aug 2021.
[HK70] Arthur E. Hoerl and Robert W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970.
[HLSS15] Elad Hazan, Kfir Yehuda Levy, and Shai Shalev-Shwartz. Beyond convexity: Stochastic quasi-convex optimization. In Neural Information Processing Systems, 2015.
[KB15] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), San Diega, CA, USA, 2015.
[LB17] Ying Li and Simon C. Benjamin. Efficient variational quantum simulator incorporating active error minimization. Phys. Rev. X, 7:021050, Jun 2017.
[LWM ${}^{+}$ 23] Junyu Liu, Frederik Wilde, Antonio Anna Mele, Liang Jiang, and Jens Eisert. Stochastic noise can be helpful for variational quantum algorithms, 2023.
[MBK21] Andrea Mari, Thomas R. Bromley, and Nathan Killoran. Estimating the gradient and higher-order derivatives on quantum hardware. Physical Review A, 103(1), jan 2021.
[MNKF18] K. Mitarai, M. Negoro, M. Kitagawa, and K. Fujii. Quantum circuit learning. Physical Review A, 98(3), sep 2018.
[Nes83] Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence $o(1/k^{2})$ . Doklady AN USSR, 269:543–547, 1983.
[PMS ${}^{+}$ 14] Alberto Peruzzo, Jarrod McClean, Peter Shadbolt, Man-Hong Yung, Xiao-Qi Zhou, Peter J. Love, Alán Aspuru-Guzik, and Jeremy L. O’Brien. A variational eigenvalue solver on a photonic quantum processor. Nature Communications, 5(1):4213, 2014.
[RSPC23] Matteo Robbiati, Alejandro Sopena, Andrea Papaluca, and Stefano Carrazza. Real-time error mitigation for variational optimization on quantum hardware, 2023.
[SAPM23] Giuseppe Scriva, Nikita Astrakhantsev, Sebastiano Pilati, and Guglielmo Mazzola. Challenges of variational quantum optimization with measurement shot noise, 2023.
[SBG ${}^{+}$ 19] Maria Schuld, Ville Bergholm, Christian Gogolin, Josh Izaac, and Nathan Killoran. Evaluating analytic gradients on quantum hardware. Physical Review A, 99(3), mar 2019.
[SEKR23] Lars Simon, Holger Eble, Hagen-Henrik Kowalski, and Manuel Radons. Interpolating parametrized quantum circuits using blackbox queries, 2023.
[SSM21] Maria Schuld, Ryan Sweke, and Johannes Jakob Meyer. Effect of data encoding on the expressive power of variational quantum-machine-learning models. Physical Review A, 103(3), mar 2021.
[SYRY21] Y. Suzuki, H. Yano, R. Raymond, and N. Yamamoto. Normalized gradient descent for variational quantum algorithms. In 2021 IEEE International Conference on Quantum Computing and Engineering (QCE), pages 1–9, Los Alamitos, CA, USA, oct 2021. IEEE Computer Society.
[TBG17] Kristan Temme, Sergey Bravyi, and Jay M. Gambetta. Error mitigation for short-depth quantum circuits. Phys. Rev. Lett., 119:180509, Nov 2017.
[UNP ${}^{+}$ 21] Miroslav Urbanek, Benjamin Nachman, Vincent R. Pascuzzi, Andre He, Christian W. Bauer, and Wibe A. de Jong. Mitigating depolarizing noise on quantum computers with noise-estimation circuits. Phys. Rev. Lett., 127:270502, Dec 2021.
[vdBMT22] Ewout van den Berg, Zlatko K. Minev, and Kristan Temme. Model-free readout-error mitigation for quantum expectation values. Phys. Rev. A, 105:032620, Mar 2022.
[Vov13] Vladimir Vovk. Kernel Ridge Regression, pages 105–116. Springer Berlin Heidelberg, Berlin, Heidelberg, 2013.
[WFC ${}^{+}$ 21] Samson Wang, Enrico Fontana, M. Cerezo, Kunal Sharma, Akira Sone, Lukasz Cincio, and Patrick J. Coles. Noise-induced barren plateaus in variational quantum algorithms. Nature Communications, 12(1):6961, 2021.
[WIWL22] David Wierichs, Josh Izaac, Cody Wang, and Cedric Yen-Yu Lin. General parameter-shift rules for quantum gradients. Quantum, 6:677, March 2022.