License: arXiv.org perpetual non-exclusive license
arXiv:2403.03826v1 [quant-ph] 06 Mar 2024

Denoising Gradient Descent in Variational Quantum Algorithms

Lars Simon
Bundesdruckerei GmbH
[email protected]
                  Holger Eble
                Bundesdruckerei GmbH
               [email protected]
   Hagen-Henrik Kowalski
Bundesdruckerei GmbH
[email protected]
        Manuel Radons
      Bundesdruckerei GmbH
      [email protected]
(March 2024)
Abstract

In this article we introduce an algorithm for mitigating the adverse effects of noise on gradient descent in variational quantum algorithms. This is accomplished by computing a regularized local classical approximation to the objective function at every gradient descent step. The computational overhead of our algorithm is entirely classical, i.e., the number of circuit evaluations is exactly the same as when carrying out gradient descent using the parameter-shift rules. We empirically demonstrate the advantages offered by our algorithm on randomized parametrized quantum circuits.

1 Introduction

In variational quantum algorithms (VQAs) a (typically gradient-based) classical optimizer is used to train a parametrized quantum circuit. While there is evidence that the presence of noise can be helpful for avoiding saddle points in VQAs [LWM+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT23], noise is generally detrimental to their performance [FFR+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT21], [WFC+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT21]. In line with this, several techniques for mitigating the effect of noise in quantum algorithms (and VQAs in particular) have been proposed, see, e.g., [LB17], [TBG17], [CACC21], [vdBMT22], [UNP+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT21], [RSPC23].

In this article we introduce an algorithm for mitigating the adverse effects of noise on gradient descent in VQAs. As is the case for the algorithm introduced in [RSPC23], the error mitigating techniques are applied in real-time, i.e., during execution of gradient descent. The idea is, roughly speaking, to compute an approximation to the objective function at every gradient descent step. Computation of such approximations is facilitated by the fact that the set of possible objective functions naturally embeds into a certain reproducing kernel Hilbert space, whose structure can be exploited. Non-surprisingly, the quality of the approximation will in general not be good on the entire parameter space – otherwise we could forego the need for a quantum device and simply work with the classical approximation instead. However, at every gradient descent step, we are guaranteed to obtain a good approximation locally around the current point in parameter space, see Remark 1 below. The benefit of computing a local approximation is that samples from past iterations can be taken into account in order to make the approximation more robust to noise. Moreover, computing an approximation allows us to use regularization techniques from classical machine learning.

Our method is agnostic to the type of noise that evaluation of the objective function is subjected to; as a result, our algorithm can be seen as a general purpose method with applications in a variety of settings. However, this agnosticism comes with some disadvantages, the most obvious one being that our algorithm might fail to mitigate certain types of noise. Moreover, for specific types of noise, our algorithm is likely to be outperformed by specialized methods. The advantages and drawbacks of our algorithm are discussed in more detail in Section 4.

Gradients in VQAs are usually calculated using the so-called parameter-shift rules [MNKF18], [SBG+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT19], [MBK21], [WIWL22] (although the computational cost of the latter scales very unfavourably with the number of trainable parameters, see [BWP23], [AKH+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT23]). The computational overhead of our algorithm is entirely classical, i.e., the number of circuit evaluations is exactly the same as when carrying out gradient descent using the parameter-shift rules.

This article is organized as follows. In Section 2 we introduce our algorithm, which includes a description in pseudocode, see Algorithm 1. In Section 3 we analyse our algorithm experimentally on a large number of randomized parametrized quantum circuits, considering both measurement shot noise and (simulated) quantum hardware noise by using some of the fake backends (designed to mimic the behavior of IBM Quantum systems) provided by the Qiskit framework. In Section 4 we discuss the advantages and the drawbacks of our algorithm. Finally, in Section 5, we make our concluding remarks and point out some possible directions for future research.

1.1 Acknowledgement

This article was written as part of the Qu-Gov project, which was commissioned by the German Federal Ministry of Finance. The authors want to extend their gratitude to Kim Nguyen, Manfred Paeschke, Oliver Muth, Andreas Wilke, and Yvonne Ripke for their continuous encouragement and support.

2 The Algorithm

In Section 2.1 we describe the setting for our algorithm and the problem at hand, and in Section 2.2 we describe the algorithm we developed to tackle the problem.

2.1 The Setting

We let n𝑛nitalic_n be a positive integer and, for θm𝜃superscript𝑚\theta\in\mathbb{R}^{m}italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, consider the unitary

U(θ)=Cm+1Rm(θm)CmR2(θ2)C2R1(θ1)C12n×2n,𝑈𝜃subscript𝐶𝑚1subscript𝑅𝑚subscript𝜃𝑚subscript𝐶𝑚subscript𝑅2subscript𝜃2subscript𝐶2subscript𝑅1subscript𝜃1subscript𝐶1superscriptsuperscript2𝑛superscript2𝑛\displaystyle U(\theta)=C_{m+1}R_{m}(\theta_{m})C_{m}\cdots R_{2}(\theta_{2})C% _{2}R_{1}(\theta_{1})C_{1}\in\mathbb{C}^{2^{n}\times 2^{n}},italic_U ( italic_θ ) = italic_C start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⋯ italic_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_C start_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT × 2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ,

where m𝑚mitalic_m is a non-negative integer, C1,,Cm+1subscript𝐶1subscript𝐶𝑚1C_{1},\dots,C_{m+1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_C start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT are unitaries given by n𝑛nitalic_n-qubit quantum circuits and, for all j{1,,m}𝑗1𝑚j\in\{1,\dots,m\}italic_j ∈ { 1 , … , italic_m }, the unitary Rj(θj)subscript𝑅𝑗subscript𝜃𝑗R_{j}(\theta_{j})italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is a rotation of the form

Rj(θj)=exp(iθj2Gj)2n×2nsubscript𝑅𝑗subscript𝜃𝑗𝑖subscript𝜃𝑗2subscript𝐺𝑗superscriptsuperscript2𝑛superscript2𝑛\displaystyle R_{j}(\theta_{j})=\exp\left(-i\frac{\theta_{j}}{2}G_{j}\right)% \in\mathbb{C}^{2^{n}\times 2^{n}}italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = roman_exp ( - italic_i divide start_ARG italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG italic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ blackboard_C start_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT × 2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT

for some Hermitian Gj2n×2nsubscript𝐺𝑗superscriptsuperscript2𝑛superscript2𝑛G_{j}\in\mathbb{C}^{2^{n}\times 2^{n}}italic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_C start_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT × 2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT whose set of Eigenvalues is {1,1}11\{-1,1\}{ - 1 , 1 }. For example, each Gjsubscript𝐺𝑗G_{j}italic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT could be a tensor product P1PnIntensor-productsubscript𝑃1subscript𝑃𝑛superscript𝐼tensor-productabsent𝑛P_{1}\otimes\cdots\otimes P_{n}\neq I^{\otimes n}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊗ ⋯ ⊗ italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≠ italic_I start_POSTSUPERSCRIPT ⊗ italic_n end_POSTSUPERSCRIPT of Pauli matrices P1,,Pn{I,X,Y,Z}subscript𝑃1subscript𝑃𝑛𝐼𝑋𝑌𝑍P_{1},\dots,P_{n}\in\{I,X,Y,Z\}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ { italic_I , italic_X , italic_Y , italic_Z }. Moreover, we consider an observable given by a Hermitian matrix 2n×2nsuperscriptsuperscript2𝑛superscript2𝑛\mathcal{M}\in\mathbb{C}^{2^{n}\times 2^{n}}caligraphic_M ∈ blackboard_C start_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT × 2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT. Letting |ψ(θ):=U(θ)|0nassignket𝜓𝜃𝑈𝜃superscriptket0tensor-productabsent𝑛|\psi(\theta)\rangle:=U(\theta)|0\rangle^{\otimes n}| italic_ψ ( italic_θ ) ⟩ := italic_U ( italic_θ ) | 0 ⟩ start_POSTSUPERSCRIPT ⊗ italic_n end_POSTSUPERSCRIPT, we consider the expected value

f(θ):=ψ(θ)||ψ(θ),assign𝑓𝜃quantum-operator-product𝜓𝜃𝜓𝜃\displaystyle f(\theta):=\langle\psi(\theta)|\mathcal{M}|\psi(\theta)\rangle,italic_f ( italic_θ ) := ⟨ italic_ψ ( italic_θ ) | caligraphic_M | italic_ψ ( italic_θ ) ⟩ ,

which defines a function f:m:𝑓superscript𝑚f\colon\mathbb{R}^{m}\to\mathbb{R}italic_f : blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT → blackboard_R. The aim is now to minimize f𝑓fitalic_f using gradient descent. Gradients of f𝑓fitalic_f are typically calculated using the so-called parameter-shift rules [MNKF18], [SBG+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT19], [MBK21], which reduce computation of the gradient to 2m2𝑚2m2 italic_m evaluations of f𝑓fitalic_f. However, evaluation of f𝑓fitalic_f is subject to both measurement shot noise and hardware quantum noise, since it involves the execution of quantum circuits on a quantum device. As a result, gradients obtained in this way are themselves noisy, which is detrimental to performance of the gradient descent algorithm.

2.2 Denoised Gradient Descent Algorithm

As shown in [SSM21], we can write

f(θ)=ω{1,0,1}mcωeiωtθ for all θm,𝑓𝜃subscript𝜔superscript101𝑚subscript𝑐𝜔superscript𝑒𝑖superscript𝜔𝑡𝜃 for all 𝜃superscript𝑚\displaystyle f(\theta)=\sum_{\omega\in\{-1,0,1\}^{m}}c_{\omega}e^{i\omega^{t}% \theta}\text{ for all }\theta\in\mathbb{R}^{m},italic_f ( italic_θ ) = ∑ start_POSTSUBSCRIPT italic_ω ∈ { - 1 , 0 , 1 } start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_i italic_ω start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT for all italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ,

where cωsubscript𝑐𝜔c_{\omega}\in\mathbb{C}italic_c start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ∈ blackboard_C and cω=cω¯subscript𝑐𝜔¯subscript𝑐𝜔c_{\omega}=\overline{c_{-\omega}}italic_c start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT = over¯ start_ARG italic_c start_POSTSUBSCRIPT - italic_ω end_POSTSUBSCRIPT end_ARG for all ω{1,0,1}m𝜔superscript101𝑚\omega\in\{-1,0,1\}^{m}italic_ω ∈ { - 1 , 0 , 1 } start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT. We denote the set of all functions of this form as H𝐻Hitalic_H and give H𝐻Hitalic_H the structure of a reproducing kernel Hilbert space with reproducing kernel K𝐾Kitalic_K by defining

g1,g2H=[π,π]mg1(z)g2(z)dz,g1,g2H,formulae-sequencesubscriptsubscript𝑔1subscript𝑔2𝐻subscriptsuperscript𝜋𝜋𝑚subscript𝑔1𝑧subscript𝑔2𝑧differential-d𝑧subscript𝑔1subscript𝑔2𝐻\displaystyle\langle g_{1},g_{2}\rangle_{H}=\int_{[-\pi,\pi]^{m}}g_{1}(z)g_{2}% (z)\mathrm{d}z,\ g_{1},g_{2}\in H,⟨ italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT = ∫ start_POSTSUBSCRIPT [ - italic_π , italic_π ] start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_z ) italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_z ) roman_d italic_z , italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_H ,

and

K(x,z)=1(2π)mj=1m(1+2cos(xjzj)),x,zm.formulae-sequence𝐾𝑥𝑧1superscript2𝜋𝑚superscriptsubscriptproduct𝑗1𝑚12subscript𝑥𝑗subscript𝑧𝑗𝑥𝑧superscript𝑚\displaystyle K(x,z)=\frac{1}{(2\pi)^{m}}\prod_{j=1}^{m}(1+2\cos(x_{j}-z_{j}))% ,\ x,z\in\mathbb{R}^{m}.italic_K ( italic_x , italic_z ) = divide start_ARG 1 end_ARG start_ARG ( 2 italic_π ) start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_ARG ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( 1 + 2 roman_cos ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) , italic_x , italic_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT .

For the sake of numerical stability we will consider K~~𝐾\tilde{K}over~ start_ARG italic_K end_ARG instead of K𝐾Kitalic_K in our algorithms, where the former is given by

K~(x,z)=j=1m1+2cos(xjzj)3,x,zm.formulae-sequence~𝐾𝑥𝑧superscriptsubscriptproduct𝑗1𝑚12subscript𝑥𝑗subscript𝑧𝑗3𝑥𝑧superscript𝑚\displaystyle\tilde{K}(x,z)=\prod_{j=1}^{m}\frac{1+2\cos(x_{j}-z_{j})}{3},\ x,% z\in\mathbb{R}^{m}.over~ start_ARG italic_K end_ARG ( italic_x , italic_z ) = ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT divide start_ARG 1 + 2 roman_cos ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG 3 end_ARG , italic_x , italic_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT .

A more rigorous treatment of the above can be found in [SEKR23].

We will now describe the algorithm. As is the case for the gradient descent algorithm, we start out with an initial point θ0msubscript𝜃0superscript𝑚\theta_{0}\in\mathbb{R}^{m}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, a learning rate α>0𝛼0\alpha>0italic_α > 0, and a number of steps T1𝑇subscriptabsent1T\in\mathbb{Z}_{\geq 1}italic_T ∈ blackboard_Z start_POSTSUBSCRIPT ≥ 1 end_POSTSUBSCRIPT. In addition, we choose a regularization hyperparameter λ>0𝜆0\lambda>0italic_λ > 0 and, optionally, an integer 11\ell\geq 1roman_ℓ ≥ 1 for bounding the size of the linear systems of equations we will have to solve. In practice we may, of course, change α𝛼\alphaitalic_α, λ𝜆\lambdaitalic_λ, and \ellroman_ℓ from iteration to iteration and combine our algorithm with, for example, normalized gradient descent [HLSS15], Nesterov’s Accelerated Gradient method [Nes83], or the ADAM optimizer [KB15], but we will not discuss these options here for the sake of simplicity (for a treatment of these in the context of variational quantum algorithms, see [SYRY21]).

Now assume that t𝑡t\in\mathbb{Z}italic_t ∈ blackboard_Z, 1tT1𝑡𝑇1\leq t\leq T1 ≤ italic_t ≤ italic_T and that we have already carried out t1𝑡1t-1italic_t - 1 steps, which yielded points θ0,,θt1msubscript𝜃0subscript𝜃𝑡1superscript𝑚\theta_{0},\dots,\theta_{t-1}\in\mathbb{R}^{m}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT in parameter space, and, for s{0,1,,t2}𝑠01𝑡2s\in\{0,1,\dots,t-2\}italic_s ∈ { 0 , 1 , … , italic_t - 2 }, (noisy) samples

gj,ν(s)f(θs+νπ2𝐞j),j{1,,m},ν{1,1},formulae-sequencesubscriptsuperscript𝑔𝑠𝑗𝜈𝑓subscript𝜃𝑠𝜈𝜋2subscript𝐞𝑗formulae-sequence𝑗1𝑚𝜈11\displaystyle g^{(s)}_{j,\nu}\approx f\left(\theta_{s}+\nu\frac{\pi}{2}\mathbf% {e}_{j}\right),\ j\in\{1,\dots,m\},\nu\in\{-1,1\},italic_g start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_ν end_POSTSUBSCRIPT ≈ italic_f ( italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + italic_ν divide start_ARG italic_π end_ARG start_ARG 2 end_ARG bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , italic_j ∈ { 1 , … , italic_m } , italic_ν ∈ { - 1 , 1 } ,

obtained from evaluating f(θs+νπ2𝐞j)𝑓subscript𝜃𝑠𝜈𝜋2subscript𝐞𝑗f\left(\theta_{s}+\nu\frac{\pi}{2}\mathbf{e}_{j}\right)italic_f ( italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + italic_ν divide start_ARG italic_π end_ARG start_ARG 2 end_ARG bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) on a quantum device (note that {0,1,,t2}01𝑡2\{0,1,\dots,t-2\}{ 0 , 1 , … , italic_t - 2 } is the empty set when t=1𝑡1t=1italic_t = 1, i.e., in this case we merely start out with θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT). Here, 𝐞jmsubscript𝐞𝑗superscript𝑚\mathbf{e}_{j}\in\mathbb{R}^{m}bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT denotes the jthsuperscript𝑗thj^{\text{th}}italic_j start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT canonical basis vector. We now start the tthsuperscript𝑡tht^{\text{th}}italic_t start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT step of the algorithm by first obtaining (noisy) samples gj,ν(t1)f(θt1+νπ2𝐞j)subscriptsuperscript𝑔𝑡1𝑗𝜈𝑓subscript𝜃𝑡1𝜈𝜋2subscript𝐞𝑗g^{(t-1)}_{j,\nu}\approx f\left(\theta_{t-1}+\nu\frac{\pi}{2}\mathbf{e}_{j}\right)italic_g start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_ν end_POSTSUBSCRIPT ≈ italic_f ( italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + italic_ν divide start_ARG italic_π end_ARG start_ARG 2 end_ARG bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), j{1,,m}𝑗1𝑚j\in\{1,\dots,m\}italic_j ∈ { 1 , … , italic_m }, ν{1,1}𝜈11\nu\in\{-1,1\}italic_ν ∈ { - 1 , 1 }, through evaluation on a quantum device. Instead of using these samples to get a noisy estimate for the gradient f(θt1)𝑓subscript𝜃𝑡1\nabla f(\theta_{t-1})∇ italic_f ( italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) by way of the parameter-shift rules, we will instead compute a classical approximation f~tsubscript~𝑓𝑡\tilde{f}_{t}over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of f𝑓fitalic_f and then obtain θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by carrying out a gradient descent step with respect to f~tsubscript~𝑓𝑡\tilde{f}_{t}over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Non-surprisingly, the quality of the approximation will in general not be good on the entire parameter space msuperscript𝑚\mathbb{R}^{m}blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT – otherwise we could forego the need for a quantum device and simply work with the classical approximation instead. However, we are guaranteed to obtain a good estimate for the gradient of f𝑓fitalic_f at the point θt1subscript𝜃𝑡1\theta_{t-1}italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, see Remark 1.

In order to construct the approximation f~tsubscript~𝑓𝑡\tilde{f}_{t}over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we will make use of the (noisy) samples of f𝑓fitalic_f we have obtained so far. For ease of notation, we denote the collection of points (θs+νπ2𝐞j)s,j,νsubscriptsubscript𝜃𝑠𝜈𝜋2subscript𝐞𝑗𝑠𝑗𝜈\left(\theta_{s}+\nu\frac{\pi}{2}\mathbf{e}_{j}\right)_{s,j,\nu}( italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + italic_ν divide start_ARG italic_π end_ARG start_ARG 2 end_ARG bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_s , italic_j , italic_ν end_POSTSUBSCRIPT as p1(t),,pDt(t)subscriptsuperscript𝑝𝑡1subscriptsuperscript𝑝𝑡subscript𝐷𝑡p^{(t)}_{1},\dots,p^{(t)}_{D_{t}}italic_p start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT and the corresponding noisy samples (gj,ν(s))s,j,νsubscriptsubscriptsuperscript𝑔𝑠𝑗𝜈𝑠𝑗𝜈\left(g^{(s)}_{j,\nu}\right)_{s,j,\nu}( italic_g start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_ν end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_s , italic_j , italic_ν end_POSTSUBSCRIPT as v1(t),,vDt(t)subscriptsuperscript𝑣𝑡1subscriptsuperscript𝑣𝑡subscript𝐷𝑡v^{(t)}_{1},\dots,v^{(t)}_{D_{t}}italic_v start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT (the order in which we list the points does not have an impact on the function f~tsubscript~𝑓𝑡\tilde{f}_{t}over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT we are constructing). Here, j{1,,m}𝑗1𝑚j\in\{1,\dots,m\}italic_j ∈ { 1 , … , italic_m }, ν{1,1}𝜈11\nu\in\{-1,1\}italic_ν ∈ { - 1 , 1 }, and s𝑠sitalic_s ranges from 00 to t1𝑡1t-1italic_t - 1 if the optional input \ellroman_ℓ was not provided. If the optional input \ellroman_ℓ was provided, we will discard some of the older samples in order to ensure that it does not become infeasible to compute the approximation f~tsubscript~𝑓𝑡\tilde{f}_{t}over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. In this case, s𝑠sitalic_s ranges from max{0,t}0𝑡\max\{0,t-\ell\}roman_max { 0 , italic_t - roman_ℓ } to t1𝑡1t-1italic_t - 1. We now consider the linear system of equations

((K~(pk(t),pl(t)))1k,lDt+λIDt×Dt)η=(v1(t)vDt(t)), where ηDt,formulae-sequencesubscript~𝐾subscriptsuperscript𝑝𝑡𝑘subscriptsuperscript𝑝𝑡𝑙formulae-sequence1𝑘𝑙subscript𝐷𝑡𝜆subscript𝐼subscript𝐷𝑡subscript𝐷𝑡𝜂matrixsubscriptsuperscript𝑣𝑡1subscriptsuperscript𝑣𝑡subscript𝐷𝑡 where 𝜂superscriptsubscript𝐷𝑡\displaystyle\left(\left(\tilde{K}(p^{(t)}_{k},p^{(t)}_{l})\right)_{1\leq k,l% \leq D_{t}}+\lambda I_{D_{t}\times D_{t}}\right)\cdot\eta=\begin{pmatrix}v^{(t% )}_{1}\\ \vdots\\ v^{(t)}_{D_{t}}\end{pmatrix},\text{ where }\eta\in\mathbb{R}^{D_{t}},( ( over~ start_ARG italic_K end_ARG ( italic_p start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_p start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ) start_POSTSUBSCRIPT 1 ≤ italic_k , italic_l ≤ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_λ italic_I start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ⋅ italic_η = ( start_ARG start_ROW start_CELL italic_v start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_v start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) , where italic_η ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ,

where IDt×Dtsubscript𝐼subscript𝐷𝑡subscript𝐷𝑡I_{D_{t}\times D_{t}}italic_I start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT denotes the Dt×Dtsubscript𝐷𝑡subscript𝐷𝑡{D_{t}\times D_{t}}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT identity matrix. Since K~~𝐾\tilde{K}over~ start_ARG italic_K end_ARG only differs from the symmetric positive semidefinite kernel K𝐾Kitalic_K by multiplication with a positive constant, the matrix appearing in this linear system of equation is strictly positive definite (recall that λ>0𝜆0\lambda>0italic_λ > 0) and hence invertible. This implies that there exists a uniquely determined solution η(t)Dtsuperscript𝜂𝑡superscriptsubscript𝐷𝑡\eta^{(t)}\in\mathbb{R}^{D_{t}}italic_η start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. We now define f~t:m:subscript~𝑓𝑡superscript𝑚\tilde{f}_{t}\colon\mathbb{R}^{m}\to\mathbb{R}over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT → blackboard_R by

f~t(θ)=k=1Dtηk(t)K~(pk(t),θ) for all θm.subscript~𝑓𝑡𝜃superscriptsubscript𝑘1subscript𝐷𝑡subscriptsuperscript𝜂𝑡𝑘~𝐾subscriptsuperscript𝑝𝑡𝑘𝜃 for all 𝜃superscript𝑚\displaystyle\tilde{f}_{t}(\theta)=\sum_{k=1}^{D_{t}}\eta^{(t)}_{k}\tilde{K}(p% ^{(t)}_{k},\theta)\text{ for all }\theta\in\mathbb{R}^{m}.over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over~ start_ARG italic_K end_ARG ( italic_p start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_θ ) for all italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT .

Note that f~tsubscript~𝑓𝑡\tilde{f}_{t}over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be evaluated on a classical device. Since f~tHsubscript~𝑓𝑡𝐻\tilde{f}_{t}\in Hover~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_H, the parameter-shift rules apply (see Lemma 10 in [SEKR23]), and hence we get, for all j{1,,m}𝑗1𝑚j\in\{1,\dots,m\}italic_j ∈ { 1 , … , italic_m }:

jf~t(θt1)=f~t(θt1+π2𝐞j)f~t(θt1π2𝐞j)2,subscript𝑗subscript~𝑓𝑡subscript𝜃𝑡1subscript~𝑓𝑡subscript𝜃𝑡1𝜋2subscript𝐞𝑗subscript~𝑓𝑡subscript𝜃𝑡1𝜋2subscript𝐞𝑗2\displaystyle\partial_{j}\tilde{f}_{t}(\theta_{t-1})=\frac{\tilde{f}_{t}\left(% \theta_{t-1}+\frac{\pi}{2}\mathbf{e}_{j}\right)-\tilde{f}_{t}\left(\theta_{t-1% }-\frac{\pi}{2}\mathbf{e}_{j}\right)}{2},∂ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = divide start_ARG over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + divide start_ARG italic_π end_ARG start_ARG 2 end_ARG bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - divide start_ARG italic_π end_ARG start_ARG 2 end_ARG bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG 2 end_ARG ,

which allows us to compute f~t(θt1)subscript~𝑓𝑡subscript𝜃𝑡1\nabla\tilde{f}_{t}(\theta_{t-1})∇ over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ). Alternatively, f~t(θt1)subscript~𝑓𝑡subscript𝜃𝑡1\nabla\tilde{f}_{t}(\theta_{t-1})∇ over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) can be computed using the explicit expression for K~~𝐾\tilde{K}over~ start_ARG italic_K end_ARG given above. We now obtain θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as

θt=θt1αf~t(θt1).subscript𝜃𝑡subscript𝜃𝑡1𝛼subscript~𝑓𝑡subscript𝜃𝑡1\displaystyle\theta_{t}=\theta_{t-1}-\alpha\cdot\nabla\tilde{f}_{t}(\theta_{t-% 1}).italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_α ⋅ ∇ over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) .

The use of the regularization parameter λ𝜆\lambdaitalic_λ also has an impact on the length of the gradient vector f~t(θt1)subscript~𝑓𝑡subscript𝜃𝑡1\nabla\tilde{f}_{t}(\theta_{t-1})∇ over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) and we are, arguably, more interested in denoising the direction of the estimate for the gradient vector of f𝑓fitalic_f. Because of this, we may optionally want to force the step length to be the same as it would be when using the noisy estimate for the gradient of f𝑓fitalic_f. One way of doing this would be to use the alternative descent step θt=θt1αtf~t(θt1)subscript𝜃𝑡subscript𝜃𝑡1subscript𝛼𝑡subscript~𝑓𝑡subscript𝜃𝑡1\theta_{t}=\theta_{t-1}-\alpha_{t}\cdot\nabla\tilde{f}_{t}(\theta_{t-1})italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ ∇ over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ), where

αt=12(g1,1(t1)g1,1(t1)gm,1(t1)gm,1(t1))+ϵf~t(θt1)+ϵα.subscript𝛼𝑡norm12matrixsubscriptsuperscript𝑔𝑡111subscriptsuperscript𝑔𝑡111subscriptsuperscript𝑔𝑡1𝑚1subscriptsuperscript𝑔𝑡1𝑚1italic-ϵnormsubscript~𝑓𝑡subscript𝜃𝑡1italic-ϵ𝛼\displaystyle\alpha_{t}=\frac{\left\|\frac{1}{2}\begin{pmatrix}g^{(t-1)}_{1,1}% -g^{(t-1)}_{1,-1}\\ \vdots\\ g^{(t-1)}_{m,1}-g^{(t-1)}_{m,-1}\end{pmatrix}\right\|+\epsilon}{\left\|\nabla% \tilde{f}_{t}(\theta_{t-1})\right\|+\epsilon}\cdot\alpha.italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG ∥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( start_ARG start_ROW start_CELL italic_g start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT - italic_g start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , - 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_g start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , 1 end_POSTSUBSCRIPT - italic_g start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , - 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) ∥ + italic_ϵ end_ARG start_ARG ∥ ∇ over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ∥ + italic_ϵ end_ARG ⋅ italic_α .

Here, \|\cdot\|∥ ⋅ ∥ denotes the Euclidean norm and we introduced a small constant ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 for the sake of numerical stability and in order to avoid division by 00. This concludes our description of the algorithm. Note that the computational overhead of our algorithm is entirely classical; our algorithm needs exactly as many circuit evaluations as gradient descent.

The intuition behind the algorithm is explained in Remark 1. A compact description of our algorithm in pseudocode can be found in Algorithm 1.

Hyperparameters : learning rate α>0𝛼0\alpha>0italic_α > 0, regularization λ>0𝜆0\lambda>0italic_λ > 0, optional: bound 1subscriptabsent1\ell\in\mathbb{Z}_{\geq 1}roman_ℓ ∈ blackboard_Z start_POSTSUBSCRIPT ≥ 1 end_POSTSUBSCRIPT on number of iterations to consider when constructing approximation, optional: ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 for numerical stability when rescaling gradient
Input : initial point θ0msubscript𝜃0superscript𝑚\theta_{0}\in\mathbb{R}^{m}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, number of steps T1𝑇subscriptabsent1T\in\mathbb{Z}_{\geq 1}italic_T ∈ blackboard_Z start_POSTSUBSCRIPT ≥ 1 end_POSTSUBSCRIPT, function f:m:𝑓superscript𝑚f\colon\mathbb{R}^{m}\to\mathbb{R}italic_f : blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT → blackboard_R, f(θ)=0n|U(θ)U(θ)|0n𝑓𝜃quantum-operator-productsuperscript0𝑛superscript𝑈𝜃𝑈𝜃superscript0𝑛f(\theta)=\langle 0^{n}|U^{\dagger}(\theta)\mathcal{M}U(\theta)|0^{n}\rangleitalic_f ( italic_θ ) = ⟨ 0 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_U start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_θ ) caligraphic_M italic_U ( italic_θ ) | 0 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ⟩ (see Section 2.1)
Output : point θTmsubscript𝜃𝑇superscript𝑚\theta_{T}\in\mathbb{R}^{m}italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT in parameter space
1 if optional hyperparameter normal-ℓ\ellroman_ℓ not provided then
      2 Set :=Tassign𝑇\ell:=Troman_ℓ := italic_T
end if
3for t=1,,T𝑡1normal-…𝑇t=1,\dots,Titalic_t = 1 , … , italic_T do
      4 for j=1,,m𝑗1normal-…𝑚j=1,\dots,mitalic_j = 1 , … , italic_m do
            5 Obtain (noisy) samples
gj,1(t1)f(θt1+π2𝐞j),subscriptsuperscript𝑔𝑡1𝑗1𝑓subscript𝜃𝑡1𝜋2subscript𝐞𝑗\displaystyle g^{(t-1)}_{j,1}\approx f\left(\theta_{t-1}+\frac{\pi}{2}\mathbf{% e}_{j}\right),italic_g start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , 1 end_POSTSUBSCRIPT ≈ italic_f ( italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + divide start_ARG italic_π end_ARG start_ARG 2 end_ARG bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , gj,1(t1)f(θt1π2𝐞j),subscriptsuperscript𝑔𝑡1𝑗1𝑓subscript𝜃𝑡1𝜋2subscript𝐞𝑗\displaystyle g^{(t-1)}_{j,-1}\approx f\left(\theta_{t-1}-\frac{\pi}{2}\mathbf% {e}_{j}\right),italic_g start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , - 1 end_POSTSUBSCRIPT ≈ italic_f ( italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - divide start_ARG italic_π end_ARG start_ARG 2 end_ARG bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ,
by evaluating f𝑓fitalic_f at the respective points using a quantum device
       end for
      109876Assemble the collections
(θs+νπ2𝐞j)s,j,ν,subscriptsubscript𝜃𝑠𝜈𝜋2subscript𝐞𝑗𝑠𝑗𝜈\displaystyle\left(\theta_{s}+\nu\frac{\pi}{2}\mathbf{e}_{j}\right)_{s,j,\nu},( italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + italic_ν divide start_ARG italic_π end_ARG start_ARG 2 end_ARG bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_s , italic_j , italic_ν end_POSTSUBSCRIPT , (gj,ν(s))s,j,νsubscriptsubscriptsuperscript𝑔𝑠𝑗𝜈𝑠𝑗𝜈\displaystyle\left(g^{(s)}_{j,\nu}\right)_{s,j,\nu}( italic_g start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j , italic_ν end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_s , italic_j , italic_ν end_POSTSUBSCRIPT
into tuples (p1(t),,pDt(t))subscriptsuperscript𝑝𝑡1subscriptsuperscript𝑝𝑡subscript𝐷𝑡\left({p^{(t)}_{1},\dots,p^{(t)}_{D_{t}}}\right)( italic_p start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) and (v1(t),,vDt(t))subscriptsuperscript𝑣𝑡1subscriptsuperscript𝑣𝑡subscript𝐷𝑡\left({v^{(t)}_{1},\dots,v^{(t)}_{D_{t}}}\right)( italic_v start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) respectively, in arbitrary (but matching) order, where j{1,,m}𝑗1𝑚j\in\{1,\dots,m\}italic_j ∈ { 1 , … , italic_m }, ν{1,1}𝜈11\nu\in\{-1,1\}italic_ν ∈ { - 1 , 1 }, and s{max{0,t},,t1}𝑠0𝑡𝑡1s\in\{\max\{0,t-\ell\},\dots,t-1\}italic_s ∈ { roman_max { 0 , italic_t - roman_ℓ } , … , italic_t - 1 } Find the uniquely determined η(t)Dtsuperscript𝜂𝑡superscriptsubscript𝐷𝑡\eta^{(t)}\in\mathbb{R}^{D_{t}}italic_η start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT solving the linear system of equations
((K~(pk(t),pl(t)))1k,lDt+λIDt×Dt)η=(v1(t)vDt(t)), where ηDtformulae-sequencesubscript~𝐾subscriptsuperscript𝑝𝑡𝑘subscriptsuperscript𝑝𝑡𝑙formulae-sequence1𝑘𝑙subscript𝐷𝑡𝜆subscript𝐼subscript𝐷𝑡subscript𝐷𝑡𝜂matrixsubscriptsuperscript𝑣𝑡1subscriptsuperscript𝑣𝑡subscript𝐷𝑡 where 𝜂superscriptsubscript𝐷𝑡\displaystyle\left(\left(\tilde{K}(p^{(t)}_{k},p^{(t)}_{l})\right)_{1\leq k,l% \leq D_{t}}+\lambda I_{D_{t}\times D_{t}}\right)\cdot\eta=\begin{pmatrix}v^{(t% )}_{1}\\ \vdots\\ v^{(t)}_{D_{t}}\end{pmatrix},\text{ where }\eta\in\mathbb{R}^{D_{t}}( ( over~ start_ARG italic_K end_ARG ( italic_p start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_p start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ) start_POSTSUBSCRIPT 1 ≤ italic_k , italic_l ≤ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_λ italic_I start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ⋅ italic_η = ( start_ARG start_ROW start_CELL italic_v start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_v start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) , where italic_η ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
Set f~t(θ)=k=1Dtηk(t)K~(pk(t),θ)subscript~𝑓𝑡𝜃superscriptsubscript𝑘1subscript𝐷𝑡subscriptsuperscript𝜂𝑡𝑘~𝐾subscriptsuperscript𝑝𝑡𝑘𝜃\tilde{f}_{t}(\theta)=\sum_{k=1}^{D_{t}}\eta^{(t)}_{k}\tilde{K}(p^{(t)}_{k},\theta)over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_η start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT over~ start_ARG italic_K end_ARG ( italic_p start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_θ ) Compute the gradient
f~t(θt1)=(f~t(θt1+π2𝐞j)f~t(θt1π2𝐞j)2)j=1,,msubscript~𝑓𝑡subscript𝜃𝑡1subscriptsubscript~𝑓𝑡subscript𝜃𝑡1𝜋2subscript𝐞𝑗subscript~𝑓𝑡subscript𝜃𝑡1𝜋2subscript𝐞𝑗2𝑗1𝑚\displaystyle\nabla\tilde{f}_{t}(\theta_{t-1})=\left({\frac{\tilde{f}_{t}\left% (\theta_{t-1}+\frac{\pi}{2}\mathbf{e}_{j}\right)-\tilde{f}_{t}\left(\theta_{t-% 1}-\frac{\pi}{2}\mathbf{e}_{j}\right)}{2}}\right)_{j=1,\dots,m}∇ over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = ( divide start_ARG over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + divide start_ARG italic_π end_ARG start_ARG 2 end_ARG bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - divide start_ARG italic_π end_ARG start_ARG 2 end_ARG bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG 2 end_ARG ) start_POSTSUBSCRIPT italic_j = 1 , … , italic_m end_POSTSUBSCRIPT
(alternatively, f~t(θt1)subscript~𝑓𝑡subscript𝜃𝑡1\nabla\tilde{f}_{t}(\theta_{t-1})∇ over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) can be computed using the explicit expression for K~~𝐾\tilde{K}over~ start_ARG italic_K end_ARG) if optional hyperparameter ϵitalic-ϵ\epsilonitalic_ϵ not provided then
            11Set αt:=αassignsubscript𝛼𝑡𝛼\alpha_{t}:=\alphaitalic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := italic_α (in this case we do not rescale the gradient)
       end if
      12else
            13Set αt:=12(g1,1(t1)g1,1(t1)gm,1(t1)gm,1(t1))+ϵf~t(θt1)+ϵαassignsubscript𝛼𝑡norm12matrixsubscriptsuperscript𝑔𝑡111subscriptsuperscript𝑔𝑡111subscriptsuperscript𝑔𝑡1𝑚1subscriptsuperscript𝑔𝑡1𝑚1italic-ϵnormsubscript~𝑓𝑡subscript𝜃𝑡1italic-ϵ𝛼\alpha_{t}:=\frac{\left\|\frac{1}{2}\begin{pmatrix}g^{(t-1)}_{1,1}-g^{(t-1)}_{% 1,-1}\\ \vdots\\ g^{(t-1)}_{m,1}-g^{(t-1)}_{m,-1}\end{pmatrix}\right\|+\epsilon}{\left\|\nabla% \tilde{f}_{t}(\theta_{t-1})\right\|+\epsilon}\cdot\alphaitalic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := divide start_ARG ∥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( start_ARG start_ROW start_CELL italic_g start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT - italic_g start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , - 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_g start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , 1 end_POSTSUBSCRIPT - italic_g start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , - 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) ∥ + italic_ϵ end_ARG start_ARG ∥ ∇ over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ∥ + italic_ϵ end_ARG ⋅ italic_α
       end if
      14Set θt:=θt1αtf~t(θt1)assignsubscript𝜃𝑡subscript𝜃𝑡1subscript𝛼𝑡subscript~𝑓𝑡subscript𝜃𝑡1\theta_{t}:=\theta_{t-1}-\alpha_{t}\cdot\nabla\tilde{f}_{t}(\theta_{t-1})italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ ∇ over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT )
end for
15return θTsubscript𝜃𝑇\theta_{T}italic_θ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT
Algorithm 1 Denoised Gradient Descent
Remark 1.

In order to understand the significance the function f~tsubscriptnormal-~𝑓𝑡\tilde{f}_{t}over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, it is instructive to consider the hypothetical scenario where λ𝜆\lambdaitalic_λ is replaced by 00 and where all samples we obtained are completely noiseless, i.e., coincide with the exact values of f𝑓fitalic_f at the respective points. In this case, the above linear system of equations reduces to

(K~(pk(t),pl(t)))1k,lDtη=(f(p1(t))f(pDt(t))), where ηDt.formulae-sequencesubscript~𝐾subscriptsuperscript𝑝𝑡𝑘subscriptsuperscript𝑝𝑡𝑙formulae-sequence1𝑘𝑙subscript𝐷𝑡𝜂matrix𝑓subscriptsuperscript𝑝𝑡1𝑓subscriptsuperscript𝑝𝑡subscript𝐷𝑡 where 𝜂superscriptsubscript𝐷𝑡\displaystyle\left(\tilde{K}(p^{(t)}_{k},p^{(t)}_{l})\right)_{1\leq k,l\leq D_% {t}}\cdot\eta=\begin{pmatrix}f(p^{(t)}_{1})\\ \vdots\\ f(p^{(t)}_{D_{t}})\end{pmatrix},\text{ where }\eta\in\mathbb{R}^{D_{t}}.( over~ start_ARG italic_K end_ARG ( italic_p start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_p start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ) start_POSTSUBSCRIPT 1 ≤ italic_k , italic_l ≤ italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ italic_η = ( start_ARG start_ROW start_CELL italic_f ( italic_p start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_f ( italic_p start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG ) , where italic_η ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT .

As was shown in the appendix of [SEKR23] (see also Implementation Remark 6 in this reference), the linear system of equations still has a solution η(t)Dtsuperscript𝜂𝑡superscriptsubscript𝐷𝑡\eta^{(t)}\in\mathbb{R}^{D_{t}}italic_η start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT in this setting, and the function f~tsubscriptnormal-~𝑓𝑡\tilde{f}_{t}over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, defined as above, agrees with f𝑓fitalic_f on {p1(t),,pDt(t)}subscriptsuperscript𝑝𝑡1normal-…subscriptsuperscript𝑝𝑡subscript𝐷𝑡\{p^{(t)}_{1},\dots,p^{(t)}_{D_{t}}\}{ italic_p start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT }. In particular, since the parameter-shift rules apply to both f𝑓fitalic_f and f~tsubscriptnormal-~𝑓𝑡\tilde{f}_{t}over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (since both are Habsent𝐻\in H∈ italic_H), we have for all j{1,,m}𝑗1normal-…𝑚j\in\{1,\dots,m\}italic_j ∈ { 1 , … , italic_m }:

jf(θt1)=f(θt1+π2𝐞j)f(θt1π2𝐞j)2=f~t(θt1+π2𝐞j)f~t(θt1π2𝐞j)2=jf~t(θt1).subscript𝑗𝑓subscript𝜃𝑡1𝑓subscript𝜃𝑡1𝜋2subscript𝐞𝑗𝑓subscript𝜃𝑡1𝜋2subscript𝐞𝑗2subscript~𝑓𝑡subscript𝜃𝑡1𝜋2subscript𝐞𝑗subscript~𝑓𝑡subscript𝜃𝑡1𝜋2subscript𝐞𝑗2subscript𝑗subscript~𝑓𝑡subscript𝜃𝑡1\displaystyle\partial_{j}f(\theta_{t-1})=\frac{f\left(\theta_{t-1}+\frac{\pi}{% 2}\mathbf{e}_{j}\right)-f\left(\theta_{t-1}-\frac{\pi}{2}\mathbf{e}_{j}\right)% }{2}=\frac{\tilde{f}_{t}\left(\theta_{t-1}+\frac{\pi}{2}\mathbf{e}_{j}\right)-% \tilde{f}_{t}\left(\theta_{t-1}-\frac{\pi}{2}\mathbf{e}_{j}\right)}{2}=% \partial_{j}\tilde{f}_{t}(\theta_{t-1}).∂ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_f ( italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = divide start_ARG italic_f ( italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + divide start_ARG italic_π end_ARG start_ARG 2 end_ARG bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_f ( italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - divide start_ARG italic_π end_ARG start_ARG 2 end_ARG bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG 2 end_ARG = divide start_ARG over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + divide start_ARG italic_π end_ARG start_ARG 2 end_ARG bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - divide start_ARG italic_π end_ARG start_ARG 2 end_ARG bold_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG 2 end_ARG = ∂ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) .

It follows that f(θt1)=f~t(θt1)normal-∇𝑓subscript𝜃𝑡1normal-∇subscriptnormal-~𝑓𝑡subscript𝜃𝑡1\nabla f(\theta_{t-1})=\nabla\tilde{f}_{t}(\theta_{t-1})∇ italic_f ( italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = ∇ over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ). So, in the noiseless case without regularization hyperparameter, we exactly recover the gradient of f𝑓fitalic_f at θt1subscript𝜃𝑡1\theta_{t-1}italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT.

In the noisy case, we need to introduce a small λ>0𝜆0\lambda>0italic_λ > 0 in order to ensure that the linear system of equations still has a solution. In this case, the function f~tsubscriptnormal-~𝑓𝑡\tilde{f}_{t}over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT corresponds to a solution to the optimization problem underlying kernel ridge regression [Vov13] with kernel K~normal-~𝐾\tilde{K}over~ start_ARG italic_K end_ARG, regularization hyperparameter λ𝜆\lambdaitalic_λ, and data ((pk(t),vk(t)))k=1,,Dtsubscriptsuperscriptsubscript𝑝𝑘𝑡superscriptsubscript𝑣𝑘𝑡𝑘1normal-…subscript𝐷𝑡\left((p_{k}^{(t)},v_{k}^{(t)})\right)_{k=1,\dots,D_{t}}( ( italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ) start_POSTSUBSCRIPT italic_k = 1 , … , italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT. One can hope that, in the spirit of Tikhonov regularization [HK70], the regularization hyperparameter λ𝜆\lambdaitalic_λ helps mitigate the effect that noise (stemming from the evaluation of f𝑓fitalic_f on a quantum device) has on the solution to the above linear system of equations. Moreover, this approach is very natural, since the true function f𝑓fitalic_f is known to be contained in H𝐻Hitalic_H, the reproducing kernel Hilbert space associated to the kernel K𝐾Kitalic_K, which only differs from K~normal-~𝐾\tilde{K}over~ start_ARG italic_K end_ARG by multiplication with a positive constant. However, the main benefit of our method stems from the fact that (when 2normal-ℓ2\ell\geq 2roman_ℓ ≥ 2 or the optional input normal-ℓ\ellroman_ℓ was not provided), samples from past iterations are used to improve the quality of the classical approximation f~tsubscriptnormal-~𝑓𝑡\tilde{f}_{t}over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. While one would not expect this to be beneficial in the noiseless case (we recover the exact gradient using the samples from the current iteration, see above), it is plausible that, in the noisy case, the increased number of samples involved in the approximation process will reduce the effect of noise on the gradient estimate provided by our approximation. We give numerical evidence for this in Section 3; a thorough theoretical analysis is left for future work.

3 Experiments

In this section we describe our experiments with Algorithm 1. In Section 3.1 we will randomly sample parametrized quantum circuits and points in parameter space and compare the quality of the noisy gradient to that of the denoised gradient computed by Algorithm 1. Here, quality is measured in terms of cosine similarity to the exact gradient vector, computed via statevector simulation. In Section 3.2 we randomly sample a parametrized quantum circuit and an initial point in parameter space and compare the descent of the objective function between many executions of noisy gradient descent and Algorithm 1 respectively.

In all our experiments, points in parameter space were randomly sampled from the uniform distribution on [0,2π)msuperscript02𝜋𝑚[0,2\pi)^{m}[ 0 , 2 italic_π ) start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT and we worked with the fixed observable =Znsuperscript𝑍tensor-productabsent𝑛\mathcal{M}=Z^{\otimes n}caligraphic_M = italic_Z start_POSTSUPERSCRIPT ⊗ italic_n end_POSTSUPERSCRIPT and initial state |0nsuperscriptket0tensor-productabsent𝑛|0\rangle^{\otimes n}| 0 ⟩ start_POSTSUPERSCRIPT ⊗ italic_n end_POSTSUPERSCRIPT. Informally speaking, we created random circuits by sandwiching random parametrized n-qubit Pauli rotations between random unitaries. More precisely, with the notation from Section 2.1, G1,,Gmsubscript𝐺1subscript𝐺𝑚G_{1},\dots,G_{m}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_G start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT were randomly sampled (independently) from the uniform distribution on {I,X,Y,Z}n{In}superscript𝐼𝑋𝑌𝑍tensor-productabsent𝑛superscript𝐼tensor-productabsent𝑛\{I,X,Y,Z\}^{\otimes n}\setminus\{I^{\otimes n}\}{ italic_I , italic_X , italic_Y , italic_Z } start_POSTSUPERSCRIPT ⊗ italic_n end_POSTSUPERSCRIPT ∖ { italic_I start_POSTSUPERSCRIPT ⊗ italic_n end_POSTSUPERSCRIPT }. In order to keep the circuit depth manageable, the unitaries C1,,Cm+1subscript𝐶1subscript𝐶𝑚1C_{1},\dots,C_{m+1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_C start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT were not sampled from the Haar measure on the unitary group U(2n)𝑈superscript2𝑛U(2^{n})italic_U ( 2 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ). Instead, they were randomly sampled (independently) as follows: For j{1,,m+1}𝑗1𝑚1j\in\{1,\dots,m+1\}italic_j ∈ { 1 , … , italic_m + 1 } we first chose a permutation τjSnsubscript𝜏𝑗subscript𝑆𝑛\tau_{j}\in S_{n}italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT uniformly at random and subsequently (independently) sampled unitaries U1(j),,Un/2(j)subscriptsuperscript𝑈𝑗1subscriptsuperscript𝑈𝑗𝑛2U^{(j)}_{1},\dots,U^{(j)}_{\left\lfloor{n/2}\right\rfloor}italic_U start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_U start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⌊ italic_n / 2 ⌋ end_POSTSUBSCRIPT from the Haar measure on SU(4)𝑆𝑈4SU(4)italic_S italic_U ( 4 ). The unitary Cjsubscript𝐶𝑗C_{j}italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT was then obtained by applying U1(j),,Un/2(j)subscriptsuperscript𝑈𝑗1subscriptsuperscript𝑈𝑗𝑛2U^{(j)}_{1},\dots,U^{(j)}_{\left\lfloor{n/2}\right\rfloor}italic_U start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_U start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⌊ italic_n / 2 ⌋ end_POSTSUBSCRIPT to the qubit pairs (τj(1),τj(2)),,(τj(2n/21),τj(2n/2))subscript𝜏𝑗1subscript𝜏𝑗2subscript𝜏𝑗2𝑛21subscript𝜏𝑗2𝑛2(\tau_{j}(1),\tau_{j}(2)),\dots,(\tau_{j}(2{\left\lfloor{n/2}\right\rfloor}-1)% ,\tau_{j}(2{\left\lfloor{n/2}\right\rfloor}))( italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( 1 ) , italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( 2 ) ) , … , ( italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( 2 ⌊ italic_n / 2 ⌋ - 1 ) , italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( 2 ⌊ italic_n / 2 ⌋ ) ) respectively. Note that this is precisely how the individual layers in the quantum volume test [CBS+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT19] are sampled. For a visual representation of the circuits featuring in our experiments, see Figure 1.

Refer to caption
Figure 1: This figure shows the random circuits used in the experiments in Section 3. Here, C1,,Cm+1subscript𝐶1subscript𝐶𝑚1C_{1},\dots,C_{m+1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_C start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT are randomly sampled just like the individual layers in the quantum volume test [CBS+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT19]. Moreover, G1,,Gmsubscript𝐺1subscript𝐺𝑚G_{1},\dots,G_{m}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_G start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT are randomly sampled non-identity Pauli strings and the measurement observable is always =Znsuperscript𝑍tensor-productabsent𝑛\mathcal{M}=Z^{\otimes n}caligraphic_M = italic_Z start_POSTSUPERSCRIPT ⊗ italic_n end_POSTSUPERSCRIPT. For a more rigorous description of these circuits, see the beginning of Section 3.

3.1 Alignment with Exact Gradient Vector

Here we describe the experiments we carried out to compare the respective alignment of the noisy gradient and the denoised gradient with the exact gradient vector. In all experiments we proceeded as follows: We decided on a number of samples N𝑁Nitalic_N and fixed n𝑛nitalic_n, m𝑚mitalic_m, the number of shots per circuit evaluation, and the learning rate α=0.1𝛼0.1\alpha=0.1italic_α = 0.1. For each combination of \ellroman_ℓ and quantum backend featuring in the experiment, we then repeated the following N𝑁Nitalic_N (number of samples) times:

  1. 1.

    A circuit and a point θ0msubscript𝜃0superscript𝑚\theta_{0}\in\mathbb{R}^{m}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT are sampled randomly as explained in the beginning of Section 3,

  2. 2.

    Algorithm 1 is executed for T=𝑇T=\ellitalic_T = roman_ℓ steps (including gradient rescaling), and the last (denoised) gradient computed during the algorithm (at point θ1subscript𝜃1\theta_{\ell-1}italic_θ start_POSTSUBSCRIPT roman_ℓ - 1 end_POSTSUBSCRIPT) is denoted as w1msubscript𝑤1superscript𝑚w_{1}\in\mathbb{R}^{m}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT,

  3. 3.

    The (noisy) gradient w2msubscript𝑤2superscript𝑚w_{2}\in\mathbb{R}^{m}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is computed at point θ1subscript𝜃1\theta_{\ell-1}italic_θ start_POSTSUBSCRIPT roman_ℓ - 1 end_POSTSUBSCRIPT by way of the parameter-shift rules,

  4. 4.

    Denoting the exact gradient at θ1subscript𝜃1\theta_{\ell-1}italic_θ start_POSTSUBSCRIPT roman_ℓ - 1 end_POSTSUBSCRIPT, computed via statevector simulation, as wm𝑤superscript𝑚w\in\mathbb{R}^{m}italic_w ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, the cosine similarities xj=wtwjwwjsubscript𝑥𝑗superscript𝑤𝑡subscript𝑤𝑗norm𝑤normsubscript𝑤𝑗x_{j}=\frac{w^{t}w_{j}}{\|w\|\|w_{j}\|}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = divide start_ARG italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_w ∥ ∥ italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ end_ARG between wjsubscript𝑤𝑗w_{j}italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and w𝑤witalic_w, where j{1,2}𝑗12j\in\{1,2\}italic_j ∈ { 1 , 2 }, are computed (the denominator is artificially bounded from below by a positive constant for the sake of numerical stability and in order to avoid division by zero),

  5. 5.

    The point (x1,x2)[1,1]×[1,1]subscript𝑥1subscript𝑥21111(x_{1},x_{2})\in\mathbb{[}-1,1]\times[-1,1]( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∈ [ - 1 , 1 ] × [ - 1 , 1 ] is plotted in a coordinate system.

We then end up with a scatter plot of N𝑁Nitalic_N points. The points below the diagonal {x=y}𝑥𝑦\{x=y\}{ italic_x = italic_y } correspond to outcomes where the denoised gradient obtained from Algorithm 1 was closer to the exact gradient than the noisy gradient (where closeness is measured in terms of cosine similarity; this is a sensible similarity measure, since the denoised gradient was rescaled to the length of the noisy gradient).

Note that, for each of the N𝑁Nitalic_N samples, both circuit and point in parameter space are sampled randomly (independently), i.e., there is a large number of different circuits appearing in each experiment.

We carry out two such experiments: In Section 3.1.1 we focus on the effect of measurement shot noise, whereas in Section 3.1.2 we focus on (simulated) quantum hardware noise.

3.1.1 Measurement Shot Noise

For this experiment we use N=500𝑁500N=500italic_N = 500 samples and set n=8𝑛8n=8italic_n = 8, m=8𝑚8m=8italic_m = 8, λ=0.28𝜆0.28\lambda=0.28italic_λ = 0.28. Further, we set the number of measurement shots per circuit to 200200200200. Since we want to exclusively focus on measurement shot noise in this experiment, we choose the AerSimulator (without noise model) provided by the Qiskit framework as the quantum backend for this experiment. We then carry out the procedure outlined in the beginning of Section 3.1 with =1,,616\ell=1,\dots,6roman_ℓ = 1 , … , 6, obtaining six scatter plots, see Figure 2.

For =1,,616\ell=1,\dots,6roman_ℓ = 1 , … , 6, the denoised gradient obtained from Algorithm 1 outperforms the noisy gradient in 50.0%, 72.8%, 82.6%, 88.6%, 91.8%, 93.0% of the cases respectively.

Refer to caption
(a) =11\ell=1roman_ℓ = 1
Refer to caption
(b) =22\ell=2roman_ℓ = 2
Refer to caption
(c) =33\ell=3roman_ℓ = 3
Refer to caption
(d) =44\ell=4roman_ℓ = 4
Refer to caption
(e) =55\ell=5roman_ℓ = 5
Refer to caption
(f) =66\ell=6roman_ℓ = 6
Figure 2: We investigated the effect that the choice of \ellroman_ℓ has on the ability of Algorithm 1 to mitigate measurement shot noise. To that end, for each \ellroman_ℓ, we randomly sampled N=500𝑁500N=500italic_N = 500 circuits and points in parameter space and computed both a noisy and a denoised gradient (obtained from Algorithm 1). Subsequently, both of them were compared to the exact gradient (computed via statevector simulation): For each of the N=500𝑁500N=500italic_N = 500 samples we plot a point in the x𝑥xitalic_x-y𝑦yitalic_y-plane, whose x𝑥xitalic_x- and y𝑦yitalic_y-coordinate are the cosine similarity of the exact gradient to the denoised gradient and to the noisy gradient respectively. Accordingly, points below the diagonal (red) correspond to outcomes where the denoised gradient obtained from Algorithm 1 outperformed the noisy gradient; points on or above the diagonal (blue) correspond to outcomes where this was not the case. For more details, see Section 3.1.1.

3.1.2 Quantum Hardware Noise

For this experiment we use N=250𝑁250N=250italic_N = 250 samples and set n=5𝑛5n=5italic_n = 5, m=8𝑚8m=8italic_m = 8, λ=0.04𝜆0.04\lambda=0.04italic_λ = 0.04, and =55\ell=5roman_ℓ = 5. In order to suppress the effect of measurement shot noise, we set the number of measurement shots per circuit to 10000100001000010000. We then carry out the procedure outlined in the beginning of Section 3.1 with the (simulated) quantum hardware backends FakeVigoV2, FakeNairobiV2, FakeCairoV2, FakeBrooklynV2, FakeWashingtonV2 provided by the Qiskit framework. In order to demonstrate that measurement shot noise plays a negligible role in this experiment, we also carry out the above-mentioned procedure with the AerSimulator (without noise model) provided by the Qiskit framework. We thus obtain a total of six scatter plots, see Figure 3.

As expected, for the AerSimulator (without noise model), all noisy and denoised gradients were very close to the exact gradient. For the (simulated) quantum hardware backends FakeVigoV2, FakeNairobiV2, FakeCairoV2, FakeBrooklynV2, FakeWashingtonV2, the denoised gradient obtained from Algorithm 1 outperforms the noisy gradient in 92.4%, 91.6%, 92.4%, 89.6%, 91.6% of the cases respectively.

Refer to caption
(a) FakeVigoV2
Refer to caption
(b) FakeNairobiV2
Refer to caption
(c) FakeCairoV2
Refer to caption
(d) FakeBrooklynV2
Refer to caption
(e) FakeWashingtonV2
Refer to caption
(f) AerSimulator (without noise model)
Figure 3: We investigated the ability of Algorithm 1 to mitigate the effect of quantum hardware noise. To that end, for each quantum backend, we randomly sampled N=250𝑁250N=250italic_N = 250 circuits and points in parameter space and computed both a noisy and a denoised gradient (obtained from Algorithm 1). Subsequently, both of them were compared to the exact gradient (computed via statevector simulation): For each of the N=250𝑁250N=250italic_N = 250 samples we plot a point in the x𝑥xitalic_x-y𝑦yitalic_y-plane, whose x𝑥xitalic_x- and y𝑦yitalic_y-coordinate are the cosine similarity of the exact gradient to the denoised gradient and to the noisy gradient respectively. Accordingly, points below the diagonal (red) correspond to outcomes where the denoised gradient obtained from Algorithm 1 outperformed the noisy gradient; points on or above the diagonal (blue) correspond to outcomes where this was not the case. The AerSimulator backend (without noise model) was included in order to demonstrate that the effect of measurement shot noise is negligible in this experiment. For clarity of visual presentation, we only show points for which both x𝑥xitalic_x- and y𝑦yitalic_y-coordinate are 0.6absent0.6\geq 0.6≥ 0.6 (all points which are not shown are red, i.e., correspond to outcomes where the denoised gradient outperformed the noisy gradient). For more details, see Section 3.1.2.

3.2 Descent of Objective Function

In order to keep the computational load manageable, we set n=4𝑛4n=4italic_n = 4, m=4𝑚4m=4italic_m = 4, =55\ell=5roman_ℓ = 5, learning rate α=0.4𝛼0.4\alpha=0.4italic_α = 0.4, number of steps T=60𝑇60T=60italic_T = 60. We further set the number of measurement shots per circuit to 50505050. We then randomly sample a single circuit and initial point θ0msubscript𝜃0superscript𝑚\theta_{0}\in\mathbb{R}^{m}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT in parameter space as described in the beginning of Section 3. For each combination of regularization hyperparameter λ𝜆\lambdaitalic_λ and quantum backend featuring in this experiment, we then repeat the following N=100𝑁100N=100italic_N = 100 times:

  1. 1.

    Execute Algorithm 1 with initial point θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (including gradient rescaling),

  2. 2.

    Execute noisy gradient descent with initial point θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT using the parameter-shift rules with the same number of steps (T=60𝑇60T=60italic_T = 60), the same learning rate (α=0.4𝛼0.4\alpha=0.4italic_α = 0.4), and the same number of measurement shots per circuit (50505050).

For both algorithms we thus obtain N=100𝑁100N=100italic_N = 100 sequences of points (θ0,,θ60)subscript𝜃0subscript𝜃60(\theta_{0},\dots,\theta_{60})( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_θ start_POSTSUBSCRIPT 60 end_POSTSUBSCRIPT ) in parameter space. For the sake of better comparability we then use statevector simulation to evaluate the exact values of f𝑓fitalic_f (see Section 2.1) at these points, which, for both algorithms, yields N=100𝑁100N=100italic_N = 100 sequences of values (f(θ0),,f(θ60))𝑓subscript𝜃0𝑓subscript𝜃60(f(\theta_{0}),\dots,f(\theta_{60}))( italic_f ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , … , italic_f ( italic_θ start_POSTSUBSCRIPT 60 end_POSTSUBSCRIPT ) ). It is important to point out that exact evaluations of the function f𝑓fitalic_f were not employed during the execution of either algorithm, but rather after the executions of the algorithms in order to compare the results. I.e., exact evaluations of f𝑓fitalic_f did not have an impact on the execution of either algorithm.

For both algorithms we then computed the component-wise average of the N=100𝑁100N=100italic_N = 100 sequences of values, giving precisely one sequence of values of length T+1=61𝑇161T+1=61italic_T + 1 = 61 for each algorithm. These two sequences can be interpreted as the respective average performance of Algorithm 1 and noisy gradient descent. For comparison sake we then computed a corresponding sequence of 61616161 points using exact gradient descent (same initial point, number of steps, and learning rate as for noisy gradient descent, but the gradients were computed via the parameter-shift rules using statevector simulation).

This procedure was carried out for each of the 9999 combinations of regularization hyperparameters λ=4/50𝜆450\lambda=4/\sqrt{50}italic_λ = 4 / square-root start_ARG 50 end_ARG, λ=0.01/50𝜆0.0150\lambda=0.01/\sqrt{50}italic_λ = 0.01 / square-root start_ARG 50 end_ARG, and λ=0.001/50𝜆0.00150\lambda=0.001/\sqrt{50}italic_λ = 0.001 / square-root start_ARG 50 end_ARG with (simulated) quantum hardware backends FakeVigoV2, FakeNairobiV2, FakeCairoV2. The results are visualized in Figure 4.

The results shown in Figure 4 validate our algorithm. However, as expected, there is some indication that the optimal choice for the regularization hyperparameter λ𝜆\lambdaitalic_λ might be device-dependent. Non-surprisingly, the results seem to further indicate that – for optimal performance – λ𝜆\lambdaitalic_λ should be adjusted over the course of the algorithm (e.g., based on the length of the noisy gradient vector).

Refer to caption
(a) FakeVigoV2, Algorithm 1 with λ=0.01/50𝜆0.0150\lambda=0.01/\sqrt{50}italic_λ = 0.01 / square-root start_ARG 50 end_ARG compared to noisy and exact gradient descent
Refer to caption
(b) FakeNairobiV2, Algorithm 1 with λ=0.01/50𝜆0.0150\lambda=0.01/\sqrt{50}italic_λ = 0.01 / square-root start_ARG 50 end_ARG compared to noisy and exact gradient descent
Refer to caption
(c) FakeCairoV2, Algorithm 1 with λ=0.01/50𝜆0.0150\lambda=0.01/\sqrt{50}italic_λ = 0.01 / square-root start_ARG 50 end_ARG compared to noisy and exact gradient descent
Refer to caption
(d) FakeVigoV2, Algorithm 1 with different values for λ𝜆\lambdaitalic_λ
Refer to caption
(e) FakeNairobiV2, Algorithm 1 with different values for λ𝜆\lambdaitalic_λ
Refer to caption
(f) FakeCairoV2, Algorithm 1 with different values for λ𝜆\lambdaitalic_λ
Figure 4: We executed Algorithm 1 on several (simulated) quantum hardware backends and using several values for the regularization hyperparameter λ𝜆\lambdaitalic_λ. The circuit and the initial point in parameter space were sampled randomly. For the sake of comparison, we also executed exact gradient descent (using statevector simulation) and noisy gradient descent using the parameter-shift rules. Algorithm 1 was executed N=100𝑁100N=100italic_N = 100 times for each combination of λ𝜆\lambdaitalic_λ and quantum hardware backend. Noisy gradient descent was executed N=100𝑁100N=100italic_N = 100 times for each quantum hardware backend. The resulting families of N=100𝑁100N=100italic_N = 100 curves were averaged respectively; the standard deviation is indicated in some of the plots. For more details, see Section 3.2.

4 Discussion

In this section we will discuss both the advantages and the drawbacks of our algorithm compared to (noisy) gradient descent.

The obvious advantage is that, in many scenarios in the context of variational quantum algorithms, our denoised gradient descent algorithm is able to significantly accelerate the descent of the objective function and to improve the alignment of the estimated gradient vector with the exact gradient vector when compared to noisy gradient descent. Several scenarios where this is indeed the case were explored in Section 3. Moreover, the computational overhead of our algorithm is entirely classical – the number of circuit evaluations is exactly the same as when executing (noisy) gradient descent using the parameter-shift rules.

However, the denoised gradient descent algorithm comes with some caveats, the most obvious one being that, since our algorithm makes use of samples from past iterations, it is not really suitable for variational quantum algorithms which take data as input, i.e., for those variational quantum algorithms whose corresponding ansatz has parametrized gates corresponding to both inputs and trainable parameters (our algorithm would still prove beneficial if one was to carry out several consecutive gradient descent steps with the same mini batch of training data, but this is a niche application that we will not consider here). Instead, our algorithm is well-suited for variational quantum algorithms whose ansatz only contains parametrized gates corresponding to trainable parameters; the most prominent examples of the latter are variational quantum eigensolvers [PMS+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT14].

There are also caveats regarding the performance resp. feasibility of the algorithm. For example, if the number of trainable parameters m𝑚mitalic_m or the number of iterations \ellroman_ℓ to consider when computing the approximation becomes too large, it might become infeasible to solve the linear system of equations appearing in Algorithm 1: In each iteration t𝑡t\geq\ellitalic_t ≥ roman_ℓ, the latter will be a square linear system of equations with 2m2𝑚2m\ell2 italic_m roman_ℓ unknowns resp. equations. Furthermore, while we expect it to be straightforward to establish a rigorous advantage in the case of measurement shot noise (under reasonable assumptions), one cannot expect our algorithm to offer a tangible advantage for all kinds of hardware quantum noise – a detailed analysis is left for future work. In this context it is also important to mention that the optimal choice of the hyperparameter λ>0𝜆0\lambda>0italic_λ > 0 does not only depend on the number of measurement shots per circuit, but also on the quantum device on which the circuits are executed. As such, good heuristics for the choice of λ𝜆\lambdaitalic_λ will necessarily be device-dependent. Finding good heuristics for the choice of λ𝜆\lambdaitalic_λ is further complicated by the fact that – for optimal performance – λ𝜆\lambdaitalic_λ should be adjusted over the course of the algorithm (e.g., based on the length of the noisy gradient vector).

Finally, we mention that the advantage offered by our algorithm might disappear if the learning rate is chosen too large. This is because the quality of the (local) approximation computed by our algorithm declines if the points at which the objective function f𝑓fitalic_f is sampled are spaced further apart.

When weighing the advantages and drawbacks outlined above, we believe that there are some scenarios with practical relevance where using our algorithm would prove advantageous.

5 Conclusion and Outlook

In this article we introduced the denoised gradient descent algorithm, which mitigates the effect of noise on gradient descent in variational quantum algorithms. We explored the capabilities of the algorithm experimentally and discussed its advantages and drawbacks. Potential topics for future work include, but are not limited to, the following:

  • deriving rigorous performance guarantees for the algorithm under suitable assumptions on the noise,

  • thoroughly analyzing our algorithm for different types of noise channels,

  • deriving good device-dependent heuristics for the choice of the hyperparameter λ>0𝜆0\lambda>0italic_λ > 0 (including heuristics for adjusting λ𝜆\lambdaitalic_λ over the course of the algorithm),

  • determining how large the learning rate can be chosen without the quality of the local approximation deteriorating to the point where using our algorithm is no longer advantageous,

  • investigating whether our algorithm can be used to reduce the amount of measurement shots necessary to succesfully carry out gradient descent in variational quantum algorithms, see for example [SAPM23]. The experimental results in Section 3 seem to indicate that this is possible – however, further studies are needed to verify this.

References

  • [AKH+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT23] Amira Abbas, Robbie King, Hsin-Yuan Huang, William J. Huggins, Ramis Movassagh, Dar Gilboa, and Jarrod Ryan McClean. On quantum backpropagation, information reuse, and cheating measurement collapse. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • [BWP23] Joseph Bowles, David Wierichs, and Chae-Yeun Park. Backpropagation scaling in parameterised quantum circuits, 2023.
  • [CACC21] Piotr Czarnik, Andrew Arrasmith, Patrick J. Coles, and Lukasz Cincio. Error mitigation with Clifford quantum-circuit data. Quantum, 5:592, November 2021.
  • [CBS+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT19] Andrew W. Cross, Lev S. Bishop, Sarah Sheldon, Paul D. Nation, and Jay M. Gambetta. Validating quantum computers using randomized model circuits. Phys. Rev. A, 100:032328, Sep 2019.
  • [FFR+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT21] Enrico Fontana, Nathan Fitzpatrick, David Muñoz Ramo, Ross Duncan, and Ivan Rungger. Evaluating the noise resilience of variational quantum algorithms. Phys. Rev. A, 104:022403, Aug 2021.
  • [HK70] Arthur E. Hoerl and Robert W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970.
  • [HLSS15] Elad Hazan, Kfir Yehuda Levy, and Shai Shalev-Shwartz. Beyond convexity: Stochastic quasi-convex optimization. In Neural Information Processing Systems, 2015.
  • [KB15] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), San Diega, CA, USA, 2015.
  • [LB17] Ying Li and Simon C. Benjamin. Efficient variational quantum simulator incorporating active error minimization. Phys. Rev. X, 7:021050, Jun 2017.
  • [LWM+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT23] Junyu Liu, Frederik Wilde, Antonio Anna Mele, Liang Jiang, and Jens Eisert. Stochastic noise can be helpful for variational quantum algorithms, 2023.
  • [MBK21] Andrea Mari, Thomas R. Bromley, and Nathan Killoran. Estimating the gradient and higher-order derivatives on quantum hardware. Physical Review A, 103(1), jan 2021.
  • [MNKF18] K. Mitarai, M. Negoro, M. Kitagawa, and K. Fujii. Quantum circuit learning. Physical Review A, 98(3), sep 2018.
  • [Nes83] Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o(1/k2)𝑜1superscript𝑘2o(1/k^{2})italic_o ( 1 / italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Doklady AN USSR, 269:543–547, 1983.
  • [PMS+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT14] Alberto Peruzzo, Jarrod McClean, Peter Shadbolt, Man-Hong Yung, Xiao-Qi Zhou, Peter J. Love, Alán Aspuru-Guzik, and Jeremy L. O’Brien. A variational eigenvalue solver on a photonic quantum processor. Nature Communications, 5(1):4213, 2014.
  • [RSPC23] Matteo Robbiati, Alejandro Sopena, Andrea Papaluca, and Stefano Carrazza. Real-time error mitigation for variational optimization on quantum hardware, 2023.
  • [SAPM23] Giuseppe Scriva, Nikita Astrakhantsev, Sebastiano Pilati, and Guglielmo Mazzola. Challenges of variational quantum optimization with measurement shot noise, 2023.
  • [SBG+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT19] Maria Schuld, Ville Bergholm, Christian Gogolin, Josh Izaac, and Nathan Killoran. Evaluating analytic gradients on quantum hardware. Physical Review A, 99(3), mar 2019.
  • [SEKR23] Lars Simon, Holger Eble, Hagen-Henrik Kowalski, and Manuel Radons. Interpolating parametrized quantum circuits using blackbox queries, 2023.
  • [SSM21] Maria Schuld, Ryan Sweke, and Johannes Jakob Meyer. Effect of data encoding on the expressive power of variational quantum-machine-learning models. Physical Review A, 103(3), mar 2021.
  • [SYRY21] Y. Suzuki, H. Yano, R. Raymond, and N. Yamamoto. Normalized gradient descent for variational quantum algorithms. In 2021 IEEE International Conference on Quantum Computing and Engineering (QCE), pages 1–9, Los Alamitos, CA, USA, oct 2021. IEEE Computer Society.
  • [TBG17] Kristan Temme, Sergey Bravyi, and Jay M. Gambetta. Error mitigation for short-depth quantum circuits. Phys. Rev. Lett., 119:180509, Nov 2017.
  • [UNP+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT21] Miroslav Urbanek, Benjamin Nachman, Vincent R. Pascuzzi, Andre He, Christian W. Bauer, and Wibe A. de Jong. Mitigating depolarizing noise on quantum computers with noise-estimation circuits. Phys. Rev. Lett., 127:270502, Dec 2021.
  • [vdBMT22] Ewout van den Berg, Zlatko K. Minev, and Kristan Temme. Model-free readout-error mitigation for quantum expectation values. Phys. Rev. A, 105:032620, Mar 2022.
  • [Vov13] Vladimir Vovk. Kernel Ridge Regression, pages 105–116. Springer Berlin Heidelberg, Berlin, Heidelberg, 2013.
  • [WFC+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT21] Samson Wang, Enrico Fontana, M. Cerezo, Kunal Sharma, Akira Sone, Lukasz Cincio, and Patrick J. Coles. Noise-induced barren plateaus in variational quantum algorithms. Nature Communications, 12(1):6961, 2021.
  • [WIWL22] David Wierichs, Josh Izaac, Cody Wang, and Cedric Yen-Yu Lin. General parameter-shift rules for quantum gradients. Quantum, 6:677, March 2022.