The Quantum Esscher Transform

Yixian Qiu [email protected] Centre for Quantum Technologies, National University of Singapore Kelvin Koor [email protected] Centre for Quantum Technologies, National University of Singapore Patrick Rebentrost [email protected] Centre for Quantum Technologies, National University of Singapore Department of Computer Science, School of Computing, National University of Singapore

Abstract

The Esscher Transform is a tool of broad utility in various domains of applied probability. It provides the solution to a constrained minimum relative entropy optimization problem. In this work, we study the generalization of the Esscher Transform to the quantum setting. We examine a relative entropy minimization problem for a quantum density operator, potentially of wide relevance in quantum information theory. The resulting solution form motivates us to define the quantum Esscher Transform, which subsumes the classical Esscher Transform as a special case. Envisioning potential applications of the quantum Esscher Transform, we also discuss its implementation on fault-tolerant quantum computers. Our algorithm is based on the modern techniques of block-encoding and quantum singular value transformation (QSVT). We show that given block-encoded inputs, our algorithm outputs a subnormalized block-encoding of the quantum Esscher transform within accuracy $\epsilon$ in $\tilde{O}(\kappa d\log^{2}1/\epsilon)$ queries to the inputs, where $\kappa$ is the condition number of the input density operator and $d$ is the number of constraints.

1 Introduction

In probability and statistics, it is often important to find low relative-entropy distributions from a given fixed distribution. In addition, further constraints, the form and interpretation of which depend on the problem at hand, are frequently imposed on the target distribution.

An interesting example is the following: consider the process of inferring probability distributions from a set of measurement data. These data play the role of the constraints—they put restrictions on what the true distribution could be—and the available data may not suffice to uniquely determine a probability distribution. In this situation, a common approach is to invoke Jaynes’ maximum entropy principle (MaxEnt) [Jay57]. In essence, MaxEnt advocates that the selected distribution be the one that simultaneously maximizes entropy and satisfies the given constraints.

However, the situation becomes more nuanced if we already possess some knowledge of the system, say, a prior distribution. In such cases, a more refined strategy emerges: the minimum relative entropy principle. As expounded in [SJ80, OP07, ZTF13], this principle, regarded as a generalization of MaxEnt, operates by minimizing the distinguishability (characterized by the relative entropy) between the prior distribution and the distribution to be selected, while respecting the imposed constraints. This systematic approach to incorporating new data makes it fundamental in Bayesian statistics. The updating procedure results in the posterior distribution which reflects the most current understanding of the system in light of the observed data.

In the case when the measurement data is presented in the form of expectation values of selected random variables, the solution to the corresponding relative entropy minimization problem takes the form known as an Esscher Transform. Named after Swedish mathematician and economist Fredrik Esscher, who introduced the concept in 1932 in his work on risk theory [Esc32], the Esscher Transform, also known as ‘exponential tilting’ in statistics, and its various extensions have since then found many applications beyond minimizing relative entropy. Notable examples include option pricing (in mathematical finance) [GS ${}^{+}$ 93], importance sampling (for rare-event simulation) [Sie76] and Lévy processes (in financial economics) [HS06]. More recently, it has also made inroads into machine learning [BSS23], in the context of empirical risk minimization.

In this paper, we discuss the extension of the above problem to the quantum setting. We consider the following optimization problem:

$\displaystyle\text{minimize}_{\sigma\geq 0}$	$\displaystyle S(\sigma\\|\rho)$	(1.1)
s.t.	$\displaystyle\operatorname{Tr}(\sigma H_{i})=m_{i},\quad i\in[d]$
	$\displaystyle\operatorname{Tr}(\sigma)=1,$

where $\rho$ is the a priori state and $H_{i}$ , $i\in[d]$ are observables. Refer to Definition 2.4 for the precise formulation. In the first part of this work, we show the formal solution to this constrained optimization problem. The solution methodology is modelled after its classical predecessor, albeit with added technical intricacies to manage. The form of the corresponding solution then motivates us to define the quantum Esscher Transform, see Definition 2.8. The proof of the solution to the optimization problem is found in Theorem 2.5. The quantum Esscher Transform can be viewed as a generalization of the (classical) Esscher Transform, and indeed subsumes the latter as a special case. In the second part of this work, with an eye toward potential applications, we discuss the implementation of the quantum Esscher Transform on fault-tolerant quantum computers. Our algorithm is based on the modern techniques of block-encoding and the quantum singular value transformation (QSVT) [GSLW19, MRTC21]. As an input model we consider purifications of the density operator $\rho$ and block-encodings of the operators $H_{i}$ . The main algorithm is Algorithm 1, whose complexity is discussed in Theorem 4.3. The quantum Esscher transform could find applications in quantum analogues of problems in statistics, machine learning, and finance.

1.1 Preliminaries and notation

We define the following notations. Let $\mathbb{N}=\{1,2,\dots\}$ be the set of positive natural numbers. For $d\in\mathbb{N}$ , $[d]=\{1,2,\dots,d\}$ . Here $\|\cdot\|$ , $\|\cdot\|_{1}$ , $\|\cdot\|_{2}$ and $\|\cdot\|_{T}$ refer to the spectral, $l_{1}$ -, $l_{2}$ - and trace norms respectively. The symbol $\odot$ denotes component-wise product, e.g. for vectors $(v\odot w)_{i}=v_{i}w_{i}$ , for matrices $(A\odot B)_{ij}=A_{ij}B_{ij}$ . Throughout this paper, $\log$ will be base $2$ . For convenience, when calculus is involved we shall differentiate as if it were base $e$ . For a matrix $M$ we write $a\leq M\leq b$ to mean the eigenvalues of $M$ are in $[a,b]$ . Thus, $M\geq 0$ means $M$ is positive semidefinite. We denote a Hilbert space by $\mathcal{H}$ , $\mathcal{H}_{N}$ if its dimension $N$ is to be explicitly specified, the set of linear operators on $\mathcal{H}$ by $\mathcal{L}(\mathcal{H})$ , and the set of density operators on $\mathcal{H}$ by $\mathcal{D}(\mathcal{H})$ . Let $A\in\mathcal{L}(\mathcal{H})$ . The kernel of $A$ is $\ker(A):=\{\ket{\psi}\in\mathcal{H}:A\ket{\psi}=0\}$ and the support of $A$ is $\operatorname{supp}(A):=\ker(A)^{\perp}$ . Note that $\ker(A)\oplus\operatorname{supp}(A)=\mathcal{H}$ . $I_{n}$ denotes the $n$ -qubit identity operator, i.e. it is of size $2^{n}\times 2^{n}$ . We use $\tilde{O}(\cdot)$ to hide polylog factors, i.e., $\tilde{O}(f(n)):=O(f(n)\cdot{\rm polylog}(f(n)))$ . We use $A:=B$ to define expression $A$ in terms of $B$ .

A probability space is denoted by $(\Omega,\Sigma,P)$ , where $\Omega$ is the sample space, $\Sigma$ is the $\sigma$ -algebra over $\Omega$ , and $P$ is the probability measure on $\Sigma$ . While all the discussions in our work are well-defined for general probability spaces, for our purposes we shall restrict our discussion to finite sample spaces, i.e., $|\Omega|<\infty$ , and set $\Sigma=2^{\Omega}$ . In this setting, $P$ can be viewed as a $|\Omega|$ -dimensional vector residing in the hypercube $[0,1]^{|\Omega|}\subseteq\mathbb{R}^{|\Omega|}$ , with components $P(\omega)$ , $\omega\in\Omega$ and normalization $\sum_{\omega\in\Omega}P(\omega)=1$ . Note that technically, a probability measure $P$ is a function on the $\sigma$ -algebra $\Sigma$ , not $\Omega$ . Since we are dealing with a finite sample space here, knowing $P(\{\omega\})$ for all $\omega\in\Omega$ gives us full knowledge of $P$ , from the additivity property of measures. Thus we can and shall simply view $P$ as a function on $\Omega$ and write $P(\omega)$ in place of $P(\{\omega\})$ . Finally, given probability measures $P$ and $Q$ , we say $Q$ is absolutely continuous with respect to $P$ (written $Q\ll P$ ) if $P(\omega)=0\implies Q(\omega)=0$ for all $\omega$ .

2 Quantum Esscher Transform

2.1 Esscher Transform

The Esscher Transform was first defined by F. Esscher in his work on risk theory [Esc32]. Let $f:E\longrightarrow\mathbb{R}$ be a probability mass function, where $E\subset\mathbb{R}^{d}$ and $\theta\in\mathbb{R}^{d}$ . The function $f_{\theta}(x):=\frac{e^{\theta\cdot x}f(x)}{\sum_{x\in E}e^{\theta\cdot x}f(x)}$ is also a probability mass function, and it is called the Esscher Transform of $f$ with parameter $\theta$ . We can replace probability mass functions with probability density functions (accordingly, $\sum\longrightarrow\int$ ).

The Esscher Transform is a map from and onto the space of probability mass/density functions, as $\mathcal{E}(f;\theta)=f_{\theta}$ . In this work, we never invoke $\mathcal{E}$ and simply call $f_{\theta}$ the Esscher Transform of $f$ , in the same spirit as the Fourier Transform. In the context of probability theory, let $(\Omega,\Sigma,P)$ be a probability space and $X:\Omega\longrightarrow\mathbb{R}^{d}$ a random $d$ dimensional vector. This setting motivates the equivalent definition (see Remark 2.3 below) of Esscher Transforms for measures/distributions.

Definition 2.1 (Esscher Transform for probability distributions).

Given a probability distribution $P$ on a finite sample space $\Omega$ , a random variable $X:\Omega\longrightarrow\mathbb{R}^{d}$ and $\theta\in\mathbb{R}^{d}$ . The probability distribution

\displaystyle P_{\theta,X}(\omega):=\frac{e^{\theta\cdot X(\omega)}P(\omega)}{% \mathbb{E}_{P}[e^{\theta\cdot X}]}

is called the Esscher Transform of $P$ with parameter $\theta$ , with respect to $X$ . For brevity, we say $P_{\theta,X}$ is the $(\theta,X)$ -Esscher Transform of $P$ .

This definition is connected to the following problem. Fix $m\in\mathbb{R}^{d}$ . When and how can we derive from $P$ another probability measure $Q$ such that the expectation of $X$ with respect to $Q$ , $\mathbb{E}_{Q}[X]$ is equal to $m$ ? Among such probability measures, if they exist, how can we find the one that is closest (in some sense) to $P$ ? Take as a measure of closeness the relative entropy between $P$ and $Q$ ,

D(Q\|P)=\sum_{\omega\in\Omega}Q(\omega)\log\frac{Q(\omega)}{P(\omega)}.

The definition of $D(Q\|P)$ requires that $Q$ be absolutely continuous with respect to $P$ , otherwise $D(Q\|P)=\infty$ . Without loss of generality, we can assume $P$ is strictly positive on $\Omega$ . If this were not so, then let $S\subset\Omega$ denote the subset on which $P=0$ . Since $Q$ is absolutely continuous w.r.t. $P$ , we have $D(Q\|P)=\sum_{\omega\in\Omega\setminus S}Q(\omega)\log\frac{Q(\omega)}{P(% \omega)}$ , so we are reduced to an ‘effective $\Omega$ ’ on which $P$ is strictly positive. The aforementioned question can then be cast as an optimization problem with multiple constraints:

$\displaystyle\text{minimize}_{Q\in[0,1]^{\|\Omega\|}}$	$\displaystyle D(Q\\|P)$	(2.1)
s.t.	$\displaystyle\mathbb{E}_{Q}[X_{i}]=m_{i},\quad i\in[d]$
	$\displaystyle\sum_{\omega\in\Omega}Q(\omega)=1.$

Note that there are $d+1$ constraints on $Q$ , hence in feasible, non-redundant cases we have $d+1\leq|\Omega|$ . We have the following solution to the optimization problem.

Theorem 2.2.

Given a random vector $X:\Omega\longrightarrow\mathbb{R}^{d}$ and $m\in\mathbb{R}^{d}$ where $\min_{\omega\in\Omega}X_{i}(\omega)<m_{i}<\max_{\omega\in\Omega}X_{i}(\omega)$ for $i\in[d]$ . There exists a unique solution $Q^{\star}$ to problem 2.1, given by

\displaystyle Q^{\star}=\frac{e^{\lambda^{\star}\cdot X}P}{\mathbb{E}_{P}[e^{% \lambda^{\star}\cdot X}]},

where $\lambda^{\star}:=\operatorname*{argmin}_{\lambda\in\mathbb{R}^{d}}\mathbb{E}_{% P}[e^{\lambda\cdot(X-m)}]$ . Thus $Q^{\star}$ is the $(\lambda^{\star},X)$ -Esscher Transform of $P$ , see Definition 2.1.

The proof is elaborated in Appendix A.

Remark 2.3.

Let us comment on a subtlety. Above, we have called $Q^{\star}$ the Esscher Transform of $P$ . Recall that the Esscher Transform as originally defined by Esscher pertains to probability mass/density functions instead of measures. Here we show that using the same terminology for probability measures is well-justified (at least for the case when $\Omega$ is discrete). The random variable $X$ induces from the probability measure $P$ the probability mass function $P_{X}(x):=P(X^{-1}(x))$ on $E:=X(\Omega)$ . Assume we have, for probability measures $Q,P$ and random variable $X$ , that

\displaystyle Q(\omega)=\frac{e^{\theta\cdot X(\omega)}P(\omega)}{\mathbb{E}_{% P}[e^{\theta\cdot X}]}.

Then for the probability mass functions $Q_{X}$ and $P_{X}$ we have

	$\displaystyle Q_{X}(x)=Q(X^{-1}(x))$	$\displaystyle=\sum_{\omega:X(\omega)=x}Q(\omega)$
		$\displaystyle=\frac{\sum_{\omega:X(\omega)=x}e^{\theta\cdot X(\omega)}P(\omega% )}{\sum_{\omega\in\Omega}e^{\theta\cdot X(\omega)}P(\omega)}$
		$\displaystyle=\frac{e^{\theta\cdot x}P_{X}(x)}{\sum_{x\in E}\sum_{\omega:X(% \omega)=x}e^{\theta\cdot X(\omega)}P(\omega)}$
		$\displaystyle=\frac{e^{\theta\cdot x}P_{X}(x)}{\sum_{x\in E}e^{\theta\cdot x}P% _{X}(x)},$

i.e., $Q_{X}$ is the Esscher Transform of $P_{X}$ as defined above.

2.2 Quantum version

2.2.1 Problem statement

Many entities in classical probability theory have meaningful generalizations in quantum theory. For example, sample spaces, probability distributions and random variables find their respective counterparts in Hilbert spaces, density operators and observables (the latter also include the former as special instances). The quantum counterpart of the relative entropy is the quantum relative entropy,

\displaystyle S(\sigma\|\rho):=\operatorname{Tr}\{\sigma(\log\sigma-\log\rho)\},

defined for density operators $\sigma,\rho$ . As in the classical case, the definition of $S(\sigma\|\rho)$ imposes constraints on $\sigma$ and $\rho$ in order to have $S(\sigma\|\rho)<\infty$ . Namely, $\operatorname{supp}(\sigma)\subseteq\operatorname{supp}(\rho)$ (see Chapter 11, [Wil13]) or equivalently, $\ker(\rho)\subseteq\ker(\sigma)$ . Using terminology from measure theory, if this condition is satisfied we say $\sigma$ is absolutely continuous with respect to $\rho$ ( $\sigma\ll\rho$ ). This is analogous to the absolute continuity between probability distributions in classical probability theory. Now we formally state the quantized version of Problem 2.1.

Problem 2.4.

Let $\mathcal{H}_{N}$ be an $N$ -dimensional Hilbert space and $\rho\in\mathcal{D}(\mathcal{H}_{N})$ be a density operator. With $d\in\mathbb{N}$ , for $i\in[d]$ , let $H_{i}$ be an observable with $h_{i,\min}$ and $h_{i,\max}$ denoting its smallest and largest eigenvalue respectively. For $m\in\mathbb{R}^{d}$ with $h_{i,\min}<m_{i}<h_{i,\max}$ , solve

$\displaystyle\text{minimize}_{\sigma\geq 0}$	$\displaystyle S(\sigma\\|\rho)$	(2.2)
s.t.	$\displaystyle\operatorname{Tr}(\sigma H_{i})=m_{i},\quad i\in[d]$
	$\displaystyle\operatorname{Tr}(\sigma)=1.$

Here $h_{i}$ denotes a generic eigenvalue of $H_{i}$ . Note that because $\sigma,H_{i}$ are Hermitian, $\operatorname{Tr}(\sigma H_{i})$ is real. As before, we require $h_{i,\min}<m_{i}<h_{i,\max}$ , otherwise the constraints $\operatorname{Tr}(\sigma H_{i})=m_{i}$ cannot be satisfied. Finally, we can assume WLOG that $\|H_{i}\|\leq 1$ . This amounts to dividing the constraint $\operatorname{Tr}(\sigma H_{i})=m_{i}$ throughout by $\|H_{i}\|$ if necessary.

2.2.2 Solution

Before delving into the solution, let us briefly comment on a few possible concerns. First, $S(\sigma\|\rho)$ requires taking the logarithm of $\rho$ , which poses a problem if $\rho$ is not strictly positive definite. This issue is circumvented if, as mentioned above, $\ker(\rho)\subseteq\ker(\sigma)$ . The analysis becomes relatively straightforward if we partition the Hilbert space $\mathcal{H}$ into suitable subspaces and examine $\sigma$ over them separately. To this end, we introduce the following notation. Let $\mathcal{G}$ be a subspace of $\mathcal{H}$ . For $A\in\mathcal{L}(\mathcal{H})$ , denote $A_{\mathcal{G}}:=\Pi_{\mathcal{G}}A\Pi_{\mathcal{G}}\in\mathcal{L}(\mathcal{G})$ , where $\Pi_{\mathcal{G}}$ is the projector onto $\mathcal{G}$ .

Second, as in the classical case, we hope to solve this optimization problem using Lagrange multipliers. With a fixed $\rho$ , $S(\sigma\|\rho)$ is a real-valued function of complex matrices. How do we optimize such functions? In principle we could convert everything into real numbers— $M_{N}(\mathbb{C})\cong\mathbb{R}^{2N^{2}}$ , so we could view $S(\sigma\|\rho)$ as a function of $2N^{2}$ real parameters and implement conventional optimization methods. However, this conversion is generally tedious, and the resulting expression for $S(\sigma\|\rho)$ cumbersome. The ‘Wirtinger Calculus’ provides a relatively simple methodology for the optimization of such functions, through the use of ‘Wirtinger derivatives’. We state the main definitions and results of this framework in Appendix B.

We have the following result, which partially resolves Problem 2.4:

Theorem 2.5.

The solution to Problem 2.4 takes the form

\displaystyle\sigma^{\star}=\sigma_{\operatorname{supp}\rho}^{\star}\oplus% \sigma_{\ker\rho}^{\star},

(2.3)

where

\displaystyle\sigma_{\operatorname{supp}\rho}^{\star}=\frac{e^{\lambda^{\star}% \cdot H_{\operatorname{supp}\rho}+\log\rho_{\operatorname{supp}\rho}}}{% \operatorname{Tr}(e^{\lambda^{\star}\cdot H_{\operatorname{supp}\rho}+\log\rho% _{\operatorname{supp}\rho}})}\qquad\text{and}\qquad\sigma_{\ker\rho}^{\star}=% \mathbf{0}.

(2.4)

The optimal values $\lambda^{\star}\in\mathbb{R}^{d}$ are to be determined from the constraints

\displaystyle\operatorname{Tr}\left(e^{\lambda^{\star}\cdot(H_{\operatorname{% supp}\rho}-m)+\log\rho_{\operatorname{supp}\rho}}(H_{i,\operatorname{supp}\rho% }-m_{i})\right)=0\;,i\in[d].

(2.5)

Proof.

To facilitate the presentation of the solution, certain parts of the argument sequence are collated into lemmas and placed below the main body of this proof.

Step 1. First, for any candidate solution $\sigma$ we enforce $\ker\rho\subseteq\ker\sigma$ . By Lemma 2.6, this implies $\sigma_{\ker\rho}=\mathbf{0}$ and furthermore enables the decomposition of $\sigma$ into a direct sum: $\sigma=\sigma_{\operatorname{supp}\rho}\oplus\sigma_{\ker\rho}$ . With this decomposition, we can consider the trace of the operators over just the subspace $\operatorname{supp}\rho$ . More specifically, $\operatorname{Tr}(\sigma H_{i})=\operatorname{Tr}(\sigma(\Pi_{\operatorname{% supp}\rho}+\Pi_{\ker\rho})H_{i}(\Pi_{\operatorname{supp}\rho}+\Pi_{\ker\rho}))% =\operatorname{Tr}(\sigma_{\operatorname{supp}\rho}H_{i,\operatorname{supp}% \rho})$ ¹¹1Recall that for any $A\in\mathcal{L}(\mathcal{H})$ , $\ker A\oplus\operatorname{supp}A=\mathcal{H}$ , so $\Pi_{\ker A}+\Pi_{\operatorname{supp}A}=I$ . and

	$\displaystyle S(\sigma\\|\rho)$	$\displaystyle=\operatorname{Tr}\{\sigma_{\operatorname{supp}\rho}\oplus\sigma_% {\ker\rho}\left(\log(\sigma_{\operatorname{supp}\rho}\oplus\sigma_{\ker\rho})-% \log(\rho_{\operatorname{supp}\rho}\oplus\rho_{\ker\rho})\right)\}$
		$\displaystyle=\operatorname{Tr}\{\sigma_{\operatorname{supp}\rho}(\log\sigma_{% \operatorname{supp}\rho}-\log\rho_{\operatorname{supp}\rho})\}+\underbrace{% \operatorname{Tr}\{\sigma_{\ker\rho}(\log\sigma_{\ker\rho}-\log\rho_{\ker\rho}% )\}}_{=0}$
		$\displaystyle=S(\sigma_{\operatorname{supp}\rho}\\|\rho_{\operatorname{supp}% \rho}).$

Thus, we can replace $\mathcal{H}$ in Problem 2.4 by $\operatorname{supp}\rho$ , and the operators by their restrictions to $\operatorname{supp}\rho$ . Note that $\rho_{\operatorname{supp}\rho}$ is positive definite.

Step 2. Next we obtain the form of $\sigma_{\operatorname{supp}\rho}$ . For ease of presentation let us simply denote $(\sigma/\rho/H_{i})_{\operatorname{supp}\rho}$ by $(\sigma/\rho/H_{i})$ . With $\rho$ now positive definite, $\log\rho$ is well-defined. Now we invoke Proposition B.1 to extract the optimal $\sigma$ by setting $\frac{\partial\mathcal{L}}{\partial\sigma}=\mathbf{0}$ .

Set up the Lagrangian

\displaystyle\mathcal{L}=\operatorname{Tr}\{\sigma(\log\sigma-\log\rho)\}-\sum% _{i}\lambda_{i}(\operatorname{Tr}(\sigma H_{i})-m_{i})-\eta(\operatorname{Tr}% \sigma-1)

(2.6)

where $\lambda_{i}$ and $\eta$ are the Lagrange multipliers. Making use of Propositions B.2 and B.3, setting $\frac{\partial\mathcal{L}}{\partial\sigma}$ to zero gives

	$\displaystyle\frac{\partial\mathcal{L}}{\partial\sigma}=\mathbf{0}$	$\displaystyle\implies(\log\sigma)^{T}+I-(\log\rho)^{T}-(\lambda\cdot H)^{T}-% \eta I=\mathbf{0}$
		$\displaystyle\implies\sigma=e^{\eta-1}e^{\lambda\cdot H+\log\rho}$
		$\displaystyle\implies\sigma=\frac{e^{\lambda\cdot H+\log\rho}}{\operatorname{% Tr}(e^{\lambda\cdot H+\log\rho})}\qquad\text{after normalization}.$

It remains to determine $\lambda$ from the constraints $\operatorname{Tr}(\sigma H)=m$ . Plugging in the above expression for $\sigma$ into the constraints we have

	$\displaystyle\frac{\operatorname{Tr}(e^{\lambda\cdot H+\log\rho}H)}{% \operatorname{Tr}(e^{\lambda\cdot H+\log\rho})}=m\implies$	$\displaystyle\frac{\operatorname{Tr}(e^{\lambda\cdot H+\log\rho}(H-m))}{% \operatorname{Tr}(e^{\lambda\cdot H+\log\rho})}=0$
	$\displaystyle\implies$	$\displaystyle\operatorname{Tr}(e^{\lambda\cdot(H-m)+\log\rho}(H-m))=0.$

Step 3. Now we show that $\sigma^{\star}$ as given in Eq. 2.4 indeed minimizes $S(\sigma\|\rho)$ . But this follows easily from Lemma 2.7. Furthermore, since $S(\sigma\|\rho)$ is a strictly convex functional of $\sigma$ , it can have at most one minimizer in the convex set $M$ , thereby showing the uniqueness of $\sigma^{\star}$ . Finally, again by Lemma 2.7 we note that $\lambda^{\star}$ satisfies $\lambda^{\star}=\operatorname*{argmax}_{\lambda\in\mathbb{R}^{d}}\left[\lambda% \cdot m-\log\operatorname{Tr}(e^{\lambda\cdot H+\log\rho})\right]=% \operatorname*{argmin}_{\lambda\in\mathbb{R}^{d}}\log\operatorname{Tr}(e^{% \lambda\cdot(H-m)+\log\rho})=\operatorname*{argmin}_{\lambda\in\mathbb{R}^{d}}% \operatorname{Tr}(e^{\lambda\cdot(H-m)+\log\rho})$ , where the last equality holds because $\log f(x)$ and $f(x)$ share the same minimum/maximum points, provided $f(x)>0$ at those points. ∎

Lemma 2.6.

Let $\sigma,\rho\in\mathcal{L}(\mathcal{H})$ be normal operators, so that they have spectral decompositions. If $\ker\rho\subseteq\ker\sigma$ , then $\sigma_{\ker\rho}=\mathbf{0}$ and $\sigma$ can be partitioned into a direct sum:

\displaystyle\sigma=\sigma_{\operatorname{supp}\rho}\oplus\sigma_{\ker\rho}.

Proof.

Expand $\sigma$ in terms of the eigenbasis of $\rho$ , $\{\ket{i}\}_{i=0}^{N-1}$ . Let $S\subseteq[N]-1$ be the index subset such that $\text{span}\{\ket{i}:i\in S\}=\operatorname{supp}\rho$ , so $\text{span}\{\ket{i}:i\in S^{c}\}=\ker\rho$ . We have

	$\displaystyle\sigma=\sum_{i,j=0}^{N-1}\braket{i}{\sigma}{j}\ket{i}\bra{j}=$	$\displaystyle\underbrace{\sum_{i\in S}\sum_{j\in S}\braket{i}{\sigma}{j}\ket{i% }\bra{j}}_{=\;\sigma_{\operatorname{supp}\rho}}+\underbrace{\sum_{i\in S}\sum_% {j\in S^{c}}\braket{i}{\sigma}{j}\ket{i}\bra{j}}_{=\mathbf{0}}$
		$\displaystyle+\underbrace{\sum_{i\in S^{c}}\sum_{j\in S}\braket{i}{\sigma}{j}% \ket{i}\bra{j}}_{=\mathbf{0}}+\underbrace{\sum_{i\in S^{c}}\sum_{j\in S^{c}}% \braket{i}{\sigma}{j}\ket{i}\bra{j}}_{=\;\sigma_{\ker\rho}=\;\mathbf{0}},$

where the annihilation of the last three terms comes about because for $i\in S^{c}$ , $\ket{i}\in\ker\rho\subseteq\ker\sigma$ .

Note that the partition of an operator into a direct sum over another operator’s ker and supp subspaces does not hold in general. ∎

The following lemma is the quantized version of Lemma A.1. We employ analogous arguments and notation, starting with

\displaystyle\Lambda=\left\{\frac{e^{\lambda\cdot H+\log\rho}}{\operatorname{% Tr}(e^{\lambda\cdot H+\log\rho})}:\lambda\in\mathbb{R}^{d}\right\}\quad\text{% and}\quad M=\{\sigma:\operatorname{Tr}(\sigma H)=m\}.

Lemma 2.7.

Let $\rho\in\mathcal{D}(\mathcal{H})$ and $H_{i},i\in[d]$ be observables on $\mathcal{H}$ . Fix $m\in\mathbb{R}^{d}$ . Then for any density operator $\sigma\in\mathcal{D}(\mathcal{H})$ satisfying $\operatorname{Tr}(\sigma H)=m$ , we have

\displaystyle S(\sigma\|\rho)\geq\sup_{\lambda\in\mathbb{R}^{d}}\left[\lambda% \cdot m-\log\operatorname{Tr}(e^{\lambda\cdot H+\log\rho})\right].

(2.7)

Moreover the inequality is saturated if $\sigma=\sigma_{\lambda^{\prime}}:=e^{\lambda^{\prime}\cdot H+\log\rho}/% \operatorname{Tr}(e^{\lambda^{\prime}\cdot H+\log\rho)}\in\Lambda\cap M$ for some $\lambda^{\prime}\in\mathbb{R}^{d}$ :

\displaystyle S(\sigma_{\lambda^{\prime}}\|\rho)=\lambda^{\prime}\cdot m-\log% \operatorname{Tr}(e^{\lambda^{\prime}\cdot H+\log\rho})=\sup_{\lambda\in% \mathbb{R}^{d}}\left[\lambda\cdot m-\log\operatorname{Tr}(e^{\lambda\cdot H+% \log\rho})\right].

(2.8)

Proof.

Each $\lambda\in\mathbb{R}^{d}$ gives rise to a corresponding $\sigma_{\lambda}\in\Lambda$ (note that $\sigma_{\lambda}$ need not be in $M$ ). Then for any $\sigma$ satisfying $\operatorname{Tr}(\sigma H)=m$ , we have

$\displaystyle S(\sigma\\|\rho)$	$\displaystyle=$	$\displaystyle S(\sigma\\|\sigma_{\lambda})+\operatorname{Tr}\{\sigma(\log\sigma% _{\lambda}-\log\rho)\}$	(2.9)
$\displaystyle(\text{nonnegativity of $S(\sigma\\|\rho)$})$	$\displaystyle\geq$	$\displaystyle\operatorname{Tr}\{\sigma(\log(e^{\lambda\cdot H+\log\rho})-\log% \operatorname{Tr}(e^{\lambda\cdot H+\log\rho})-\log\rho)\}$
	$\displaystyle=$	$\displaystyle\operatorname{Tr}\{\sigma(\lambda\cdot H)\}-\log\operatorname{Tr}% (e^{\lambda\cdot H+\log\rho})$
	$\displaystyle=$	$\displaystyle\lambda\cdot m-\log\operatorname{Tr}(e^{\lambda\cdot H+\log\rho}).$

Since this holds for all $\lambda\in\mathbb{R}^{d}$ , we conclude that $S(\sigma\|\rho)\geq\sup_{\lambda\in\mathbb{R}^{d}}\left[\lambda\cdot m-\log% \operatorname{Tr}(e^{\lambda\cdot H+\log\rho})\right]$ . Furthermore, if $\lambda^{\prime}\in\mathbb{R}^{d}$ is such that $\sigma_{\lambda^{\prime}}\in\Lambda\cap M$ , then letting $\sigma=\sigma_{\lambda^{\prime}}$ and rerunning the same argument sequence above gives

	$\displaystyle S(\sigma_{\lambda^{\prime}}\\|\rho)$	$\displaystyle=\operatorname{Tr}\{\sigma_{\lambda^{\prime}}(\log\sigma_{\lambda% ^{\prime}}-\log\rho)\}$
		$\displaystyle=\operatorname{Tr}\{\sigma_{\lambda^{\prime}}(\log(e^{\lambda^{% \prime}\cdot H+\log\rho})-\log\operatorname{Tr}(e^{\lambda^{\prime}\cdot H+% \log\rho})-\log\rho)\}$
		$\displaystyle=\operatorname{Tr}\{\sigma_{\lambda^{\prime}}(\lambda^{\prime}% \cdot H)\}-\log\operatorname{Tr}(e^{\lambda^{\prime}\cdot H+\log\rho})$
		$\displaystyle=\lambda^{\prime}\cdot m-\log\operatorname{Tr}(e^{\lambda^{\prime% }\cdot H+\log\rho}).$

In particular, this also shows that $\lambda^{\prime}=\operatorname*{argmax}_{\lambda\in\mathbb{R}^{d}}\left[% \lambda\cdot m-\log\operatorname{Tr}(e^{\lambda\cdot H+\log\rho})\right]$ . ∎

Motivated by the form of the state $\sigma_{\operatorname{supp}\rho}^{\star}$ in Theorem 2.5, we make the following definition:

Definition 2.8 (Quantum Esscher Transform).

Given a density operator $0<\rho\in\mathcal{D}(\mathcal{H})$ , observables $H_{i},\;i\in[d]$ and $\theta\in\mathbb{R}^{d}$ . The density operator

\displaystyle\rho_{\theta,H}:=\frac{e^{\theta\cdot H+\log\rho}}{\operatorname{% Tr}(e^{\theta\cdot H+\log\rho})}

is called the $(\theta,H)$ -quantum Esscher transform of $\rho$ .

Remark 2.9.

The state $\sigma_{\operatorname{supp}\rho}^{\star}$ in Theorem 2.5 is thus a $(\lambda^{\star},H_{\operatorname{supp}\rho})$ -quantum Esscher transform of $\rho_{\operatorname{supp}\rho}>0$ . Also note that the quantum Esscher transform subsumes the classical Esscher transform as a special case, wherein $\rho,H_{i}$ are diagonal and thus commute.

2.2.3 Connection to quantum imaginary time evolution

Quantum imaginary-time evolution (QITE) is a conceptual tool which relates to the finding of ground states of Hamiltonians [MJE ${}^{+}$ 19, MST ${}^{+}$ 20]. From the real-time Schrödinger equation one obtains the imaginary-time Schrödinger equation $\frac{\partial|\psi\rangle}{\partial\tau}=-H|\psi\rangle$ by performing a Wick rotation, i.e. $\tau=it$ . For general mixed states $\rho$ , the imaginary-time Liouville-von Neumann equation [BK91] is given by

\displaystyle\frac{\partial\rho}{\partial\tau}=-\{H,\rho\}+2\langle H\rangle\rho,

(2.10)

from which the solution is derived as

\displaystyle\rho(\tau)=A(\tau)e^{-\tau H}\rho(0)e^{-\tau H},

(2.11)

where $A(\tau)=1/\operatorname{Tr}(e^{-2\tau H}\rho(0))$ is the normalisation factor.

In [OP07] it was asserted that under certain conditions, namely ‘when the prior and posterior states are close to each other with respect to the Fisher information metric’, the minimizing relative entropy problem could be solved by formally integrating a ‘quantum trajectory’ equation [OP07, Bra96]. This equation takes on the same form as Eq. 2.10, and thus its solution is given by Eq. 2.11. More specifically, we have

\displaystyle\rho(\theta)=\frac{e^{\theta\cdot H/2}\rho e^{\theta\cdot H/2}}{% \operatorname{Tr}(e^{\theta\cdot H}\rho)},

where $\theta$ are the Lagrange multipliers. Here we simply observe that $\rho(\theta)$ resembles the imaginary-time-evolved state in Eq. (2.11) if $\theta$ is one-dimensional and after making the substitution $\tau=-\theta/2$ . Since the quantum Esscher transform provides an exact solution to the problem, under the aforementioned condition we note the connection between the quantum Esscher transform and QITE.

Next, we discuss how to implement the quantum Esscher Transform on quantum computers using modern techniques based on block-encodings (BE) and the quantum singular value transformation (QSVT). Before doing so we collate the relevant tools and techniques of the framework in the next section.

3 Overview on block-encodings and quantum singular value transformations

The technique of quantum signal processing [LYC16] and its lifting, via ‘qubitization’, to quantum singular value transformation (QSVT) [LC19, GSLW19] provide a concise way to formulate quantum algorithms, particularly for linear algebraic tasks. This framework has provided more efficient implementations of several existing quantum algorithms, such as Hamiltonian simulation [LC17, LC19], amplitude amplification and estimation [GSLW19, RF23] and quantum linear systems solving [GSLW19], and even led to the discovery of new algorithms. For our purposes, we do not actually need the full generality of QSVT. As our matrices of interest are Hermitian and thus admit spectral decompositions, a relaxed version of QSVT—quantum eigenvalue transformation (QET)—suffices. We direct readers interested in learning more about QSVT to [GSLW19, MRTC21, DMB ${}^{+}$ 23].

Definition 3.1 (Block-Encoding).

Let $A$ be an $n$ -qubit matrix, $\alpha,\varepsilon\in\mathbb{R}_{+}$ and $a\in\mathbb{N}$ . We say that the $(n+a)$ -qubit unitary $U$ is an $(\alpha,a,\varepsilon)$ -block-encoding of $A$ if

\|A-\alpha(\bra{0^{a}}\otimes I_{n})U(\ket{0^{a}}\otimes I_{n})\|\leq\varepsilon.

Remark 3.2.

Note that if $U$ is an $(\alpha,a,\varepsilon)$ -BE of $A$ , then equivalently it is a $(1,a,\frac{\varepsilon}{\alpha})$ -BE of $\frac{A}{\alpha}$ . Also, if we have a $(\alpha,a,\varepsilon)$ -BE of $A$ then we also have a $(\alpha,a+a^{\prime},\varepsilon+\varepsilon^{\prime})$ -BE of $A$ , where $1\leq a^{\prime}\in\mathbb{N}$ and $\varepsilon^{\prime}>0$ . Making the increment $a^{\prime}$ simply corresponds to tacking on an extra $a^{\prime}$ -qubit identity operator $I_{a^{\prime}}$ . More specifically, if $U$ is an $(\alpha,a,\varepsilon)$ -BE of $A$ then $I_{a^{\prime}}\otimes U$ is an $(\alpha,a+a^{\prime},\varepsilon)$ -BE of $A$ , since

\displaystyle\|A-\alpha(\bra{0^{a}}\otimes I_{n})U(\ket{0^{a}}\otimes I_{n})\|% \leq\varepsilon\implies\|A-\alpha(\bra{0^{a^{\prime}+a}}\otimes I_{n})I_{a^{% \prime}}\otimes U(\ket{0^{a^{\prime}+a}}\otimes I_{n})\|\leq\varepsilon.

Finally, if $\varepsilon$ is already an error bound, $\varepsilon+\varepsilon^{\prime}$ clearly serves as another error bound, albeit a weaker one.

[GSLW19] provides a construction of exact block-encodings for density operators, assuming access to oracles which prepare the purifications of the density operators:

Definition 3.3 (Purified quantum query-access).

Let $\rho$ be an $n$ -qubit density operator. We say $\rho$ has purified quantum query-access if we have access to a $(n_{\rho}+n)$ -qubit unitary operator $O_{\rho}$ , where

\displaystyle O_{\rho}\ket{0^{n_{\rho}}}\ket{0^{n}}=\ket{\rho}

prepares $\ket{\rho}$ , the purification of $\rho$ (i.e. $\text{tr}_{n_{\rho}}\ket{\rho}\bra{\rho}=\rho$ ) with the help of $n_{\rho}$ ancilla qubits.²²2Theoretically, any $n$ -qubit quantum state can be purified with at most $n$ ancilla qubits, so one can assume $n_{\rho}\leq n$ . In practice however, it could be more convenient to use more than $n$ ancillas for purification. Thus we make the more relaxed assumption that $n_{\rho}=\operatorname{poly}(n)$ .

Proposition 3.4 (Block-encoding of density operators – Lemma 45, [GSLW19]).

Let $\rho$ be an $n$ -qubit density operator with purified quantum query-access via $O_{\rho}$ . Then $\widetilde{O_{\rho}}:=(O_{\rho}^{\dagger}\otimes I_{n})(I_{n_{\rho}+n}\otimes% \text{SWAP}_{n})(O_{\rho}\otimes I_{n})$ is a $(1,n+n_{\rho},0)$ -BE of $\rho$ .

For general matrices which need not be density operators, [CGJ18, GSLW19] also showed how to implement their block-encodings efficiently, assuming the existence of quantum random access memory (QRAM) [GLM08]. Given block-encodings of operators $A_{i}$ , we can construct block-encodings of their linear combinations and products. For linear combinations, we make use of an auxiliary tool known as a ‘state preparation pair’. Recall that $\|\cdot\|_{1}$ is the $l_{1}$ /Manhattan norm.

Definition 3.5 (State Preparation Pair).

Let $y\in\mathbb{C}^{m}$ and $\|y\|_{1}\leq\beta$ . The pair of unitaries $(P_{L},P_{R})$ is called a ( $\beta,b,\varepsilon_{\text{SP}}$ )-state-preparation-pair for $y$ if

\displaystyle P_{L}\ket{0^{b}}=\sum_{j=0}^{2^{b}-1}c_{j}\ket{j},\quad P_{R}% \ket{0^{b}}=\sum_{j=0}^{2^{b}-1}d_{j}\ket{j}

such that $\sum_{j=0}^{m-1}|y_{j}-\beta c_{j}^{*}d_{j}|\leq\varepsilon_{\text{SP}}$ and $c_{j}^{*}d_{j}=0$ for $j=m,\dots,2^{b}-1$ .

One can think of a state preparation pair as encoding the desired state/vector $y$ in the first $m$ elements of a length- $2^{b}$ column vector whose elements are $c_{j}^{*}d_{j}$ , up to an error of $\varepsilon_{\text{SP}}$ . The role of $\beta$ is to take care of normalization.

Proposition 3.6 (Linear combination of block-encoded matrices – Lemma 52, [GSLW19]).

Let

i.

$A_{j},\;j=0,\dots,m-1$ be $n$ -qubit operators with respective ( $\alpha,a,\varepsilon_{\text{BE}}$ )-BEs $U_{j}$ ,
ii.

$A=\sum_{j=0}^{m-1}y_{j}A_{j}$ for $y:=(y_{0},\dots,y_{m-1})\in\mathbb{C}^{m}$ ,
iii.

$(P_{L},P_{R})$ be a $(\beta,b,\varepsilon_{\text{SP}})$ -state-preparation-pair for $y$ .

Then there exists a $(\alpha\beta,a+b,\alpha\varepsilon_{\text{SP}}+\beta\varepsilon_{\text{BE}})$ -BE of $A$ , given by

\widetilde{W}=(P_{L}^{\dagger}\otimes I_{a}\otimes I_{n})W(P_{R}\otimes I_{a}% \otimes I_{n}),

where

W=\sum_{j=0}^{m-1}\ket{j}\bra{j}\otimes U_{j}+\sum_{j=m}^{2^{b}-1}\ket{j}\bra{% j}\otimes I_{a}\otimes I_{n}

is a $(n+a+b)$ -qubit unitary.

In Proposition 3.6, the subnormalization factors of the $A_{j}$ ’s are to be the same. Later on, we will need a slight generalization of the above result whereby this requirement is dropped.

Proposition 3.7 (Generalized linear combination of block-encoded matrices).

Let

i.

$A_{j},\;j=0,\dots,m-1$ be $n$ -qubit operators with respective ( $\alpha_{j},a,\varepsilon_{\text{BE}}$ )-BEs $U_{j}$ for $\alpha:=(\alpha_{0},\dots,\alpha_{m-1})\in\mathbb{C}^{m}$ ,
ii.

$A=\sum_{j=0}^{m-1}y_{j}A_{j}$ for $y:=(y_{0},\dots,y_{m-1})\in\mathbb{C}^{m}$ ,
iii.

$(P_{L},P_{R})$ be a $(\beta,b,\varepsilon_{\text{SP}})$ -state-preparation-pair for $\alpha\odot y$ .

Then there exists a $(\beta,a+b,\frac{\beta}{\inf_{j}\alpha_{j}}\varepsilon_{\text{BE}}+\varepsilon% _{\text{SP}})$ -BE of $A$ , given by

\widetilde{W}=(P_{L}^{\dagger}\otimes I_{a}\otimes I_{n})W(P_{R}\otimes I_{a}% \otimes I_{n}),

where

W=\sum_{j=0}^{m-1}\ket{j}\bra{j}\otimes U_{j}+\sum_{j=m}^{2^{b}-1}\ket{j}\bra{% j}\otimes I_{a}\otimes I_{n}

is a $(n+a+b)$ -qubit unitary.

Proof.

The following is adapted from the proof of Lemma 52, [GSLW19]. By definition of state-preparation pairs (see Definition 3.5), $P_{L}\ket{0^{b}}=\sum_{j=0}^{2^{b}-1}c_{j}\ket{j}$ and $P_{R}\ket{0^{b}}=\sum_{j=0}^{2^{b}-1}d_{j}\ket{j}$ such that $\sum_{j=0}^{m-1}|\alpha_{j}y_{j}-\beta c_{j}^{*}d_{j}|\leq\varepsilon_{\text{% SP}}$ . First we evaluate the block extraction of $\widetilde{W}$ . We have

	$\displaystyle\quad\;(\bra{0^{b+a}}\otimes I_{n})\widetilde{W}(\ket{0^{b+a}}% \otimes I_{n})$
	$\displaystyle=(\bra{0^{b+a}}\otimes I_{n})(P_{L}^{\dagger}\otimes I_{a}\otimes I% _{n})\left(\sum_{j=0}^{m-1}\ket{j}\bra{j}\otimes U_{j}+\sum_{j=m}^{2^{b}-1}% \ket{j}\bra{j}\otimes I_{a}\otimes I_{n}\right)(P_{R}\otimes I_{a}\otimes I_{n% })(\ket{0^{b+a}}\otimes I_{n})$
	$\displaystyle=\sum_{j=0}^{m-1}\bra{0^{b}}P_{L}^{\dagger}\ket{j}\bra{j}P_{R}% \ket{0^{b}}\cdot(\bra{0^{a}}\otimes I_{n})U_{j}(\ket{0^{a}}\otimes I_{n})$
	$\displaystyle=\sum_{j=0}^{m-1}c_{j}^{*}d_{j}\cdot(\bra{0^{a}}\otimes I_{n})U_{% j}(\ket{0^{a}}\otimes I_{n}).$

In going from the first equality to the second, we have made use of the fact that for state preparation pairs $c_{j}^{*}d_{j}=0$ for $j=m,\dots,2^{b}-1$ . The second summand in $W$ is thus annihilated. Therefore,

	$\displaystyle\left\\|A-\beta(\bra{0^{b+a}}\otimes I_{n})\widetilde{W}(\ket{0^{b% +a}}\otimes I_{n})\right\\|$	$\displaystyle=\left\\|A-\sum_{j=0}^{m-1}(\beta c_{j}^{*}d_{j}-\alpha_{j}y_{j}+% \alpha_{j}y_{j})\cdot(\bra{0^{a}}\otimes I_{n})U_{j}(\ket{0^{a}}\otimes I_{n})\right\\|$
		$\displaystyle\leq\sum_{j=0}^{m-1}\|\beta c_{j}^{*}d_{j}-\alpha_{j}y_{j}\|+\left% \\|A-\sum_{j=0}^{m-1}\alpha_{j}y_{j}(\bra{0^{a}}\otimes I_{n})U_{j}(\ket{0^{a}}% \otimes I_{n})\right\\|$
		$\displaystyle\leq\varepsilon_{\text{SP}}+\left\\|\sum_{j=0}^{m-1}y_{j}A_{j}-% \sum_{j=0}^{m-1}y_{j}\alpha_{j}(\bra{0^{a}}\otimes I_{n})U_{j}(\ket{0^{a}}% \otimes I_{n})\right\\|$
		$\displaystyle\leq\varepsilon_{\text{SP}}+\sum_{j=0}^{m-1}\|y_{j}\|\left\\|A_{j}-% \alpha_{j}(\bra{0^{a}}\otimes I_{n})U_{j}(\ket{0^{a}}\otimes I_{n})\right\\|$
		$\displaystyle\leq\varepsilon_{\text{SP}}+\sum_{j=0}^{m-1}\|y_{j}\|\varepsilon_{% \text{BE}}$
		$\displaystyle\leq\varepsilon_{\text{SP}}+\frac{\beta}{\inf_{j}\alpha_{j}}% \varepsilon_{\text{BE}}.$

where the last inequality was obtained using $\beta\geq\sum_{j=0}^{m-1}|\alpha_{j}y_{j}|\geq\sum_{j=0}^{m-1}(\inf_{k}\alpha_% {k})|y_{j}|$ . ∎

Remark 3.8.

In the special case where the block-encodings of the $A_{j}$ ’s have the same subnormalization factors, i.e., $\alpha_{j}=\alpha$ for all $j$ , we recover Proposition 3.6 from Proposition 3.7 . To see this, observe that if $(P_{L},P_{R})$ is a $(\beta,b,\varepsilon_{\text{SP}})$ -state-preparation-pair for $\alpha\odot y$ , then $\sum_{j}|\alpha_{j}y_{j}-\beta c_{j}^{*}d_{j}|\leq\varepsilon_{\text{SP}}% \implies\sum_{j}|\alpha y_{j}-\beta c_{j}^{*}d_{j}|\leq\varepsilon_{\text{SP}}% \implies\sum_{j}|y_{j}-\frac{\beta}{\alpha}c_{j}^{*}d_{j}|\leq\frac{% \varepsilon_{\text{SP}}}{\alpha}$ , thus implying $(P_{L},P_{R})$ is a $(\frac{\beta}{\alpha},b,\frac{\varepsilon_{\text{SP}}}{\alpha})$ -state-preparation-pair for $y$ . According to Proposition 3.6, $\widetilde{W}$ is then a $(\alpha\cdot\frac{\beta}{\alpha},\;a+b,\;\alpha\cdot\frac{\varepsilon_{\text{% SP}}}{\alpha}+\frac{\beta}{\alpha}\varepsilon_{\text{BE}})$ -BE of $A$ . This is in agreement with Proposition 3.7.

We now arrive at a milestone within the QSVT framework. Namely, the ability to implement block-encodings of polynomials of a matrix from a given block-encoding of the matrix. In many applications however, the functions of interest are not polynomials. In such cases, one has to first approximate the desired function by a polynomial in order to apply QSVT/QET.

Theorem 3.9 (Polynomial Eigenvalue Transformation – Theorem 56, [GSLW19]).

Let $U$ be an $(\alpha,a,\varepsilon)$ -encoding of a Hermitian matrix $A$ (equivalently, a $(1,a,\varepsilon/\alpha)$ -encoding of $A/\alpha$ ) and $P\in\mathbb{R}[x]$ be a degree- $d$ polynomial satisfying $|P(x)|\leq\frac{1}{2}$ on $[-1,1]$ . Then, one can construct a quantum circuit $\tilde{U}$ which is a $(1,a+2,4d\sqrt{\varepsilon/\alpha})$ -encoding of $P(A/\alpha)$ . $\tilde{U}$ consists of $d$ $U$ and $U^{\dagger}$ gates, one controlled- $U$ , and $\mathcal{O}((a+1)d)$ other one- and two-qubit gates.

Proposition 3.10 (Bounded Polynomial Approximation – Corollary 66, [GSLW19]).

Let $x_{0}\in[-1,1]$ , $r\in(0,2]$ , $\delta\in(0,r]$ and let $f:[x_{0}-r-\delta,x_{0}+r+\delta]\longrightarrow\mathbb{C}$ be such that $f(x)=\sum_{l=0}^{\infty}a_{l}(x-x_{0})^{l}$ for all $x\in[x_{0}-r-\delta,x_{0}+r+\delta]$ . Suppose $B>0$ is such that $\sum_{l=0}^{\infty}(r+\delta)^{l}|a_{l}|\leq B$ . Let $\varepsilon\in(0,\frac{1}{2B}]$ , then there is an efficiently computable polynomial $P\in\mathbb{C}[x]$ of degree $\mathcal{O}\left(\frac{1}{\delta}\log\left(\frac{B}{\varepsilon}\right)\right)$ such that

$\displaystyle\\|f(x)-P(x)\\|_{[x_{0}-r,x_{0}+r]}$	$\displaystyle\leq\varepsilon$	(3.1)
$\displaystyle\\|P(x)\\|_{[-1,1]}$	$\displaystyle\leq\varepsilon+\\|f(x)\\|_{[x_{0}-r-\delta/2,x_{0}+r+\delta/2]}% \leq\varepsilon+B$	(3.2)
$\displaystyle\\|P(x)\\|_{[-1,1]\setminus[x_{0}-r-\delta/2,x_{0}+r+\delta/2]}$	$\displaystyle\leq\varepsilon.$	(3.3)

If we choose $B$ sufficiently large such that $\frac{1}{2B}<1$ , then we also have an $\varepsilon$ -independent bound on $P(x)$ : $\|P(x)\|_{[-1,1]}\leq 1+B$ .

Theorem 3.9 and Proposition 3.10 are to be used in conjunction to produce block-encodings of general functions of Hermitian matrices. In doing so, we first note that Theorem 3.9 produces an encoding of $P(A/\alpha)$ , not $P(A)$ . Thus, with a polynomial approximation of $f$ , say $P(x)\approx f(x)$ , it is generally not true that $P(A/\alpha)\approx f(A)$ . What we need is a polynomial approximation not of $f$ , but of a (horizontally) scaled version of $f$ , $f^{\prime}(x):=f(\alpha x)$ , so that $P(x)\approx f^{\prime}(x)\implies P(A/\alpha)\approx f^{\prime}(A/\alpha)=f(A)$ . Second, we also have to take into account the polynomial approximation error incurred in producing the final desired block encoding $f(A)$ . We take care of these matters in Corollary 3.11, which, given the block-encoding of an arbitrary Hermitian matrix $A$ , produces a block-encoding of $f(A)$ , where $f$ is a generic real-valued function.

Corollary 3.11 (Block-encoding functions of general Hermitian matrices).

Given

i.

A Hermitian matrix $\lambda_{\min}\leq A\leq\lambda_{\max}$ , $-\infty<\lambda_{\min}<\lambda_{\max}<\infty$ and $U$ , an $(\alpha,a,\varepsilon)$ -encoding of $A$ .
ii.

$f:I\longrightarrow\mathbb{R}$ , a smooth function on an open interval $I$ containing $[\lambda_{\min},\lambda_{\max}]$ . Assume the function $x\mapsto f(\alpha x)$ satisfies the conditions in Proposition 3.10 with $[\lambda_{\min}/\alpha,\lambda_{\max}/\alpha]\subseteq[x_{0}-r,x_{0}+r]$ and series-of-coefficients bound $B$ .
iii.

Polynomial approximation error tolerance for $f$ : $\varepsilon_{\text{poly}}\in(0,\frac{1}{2}]$ .

Then there exists a quantum circuit $U_{f}$ which is a $\left(2(1+B),\;a+2,\;\varepsilon_{\text{poly}}+2(1+B)(4d\sqrt{\varepsilon/% \alpha})\right)$ -encoding of $f(A)$ . The construction of $U_{f}$ makes $d=\mathcal{O}\left(\frac{1}{\delta}\log\frac{B}{\varepsilon_{\text{poly}}}\right)$ queries to $U$ .

Proof.

First, $\alpha\geq\|A\|=\max\{|\lambda_{\min}|,|\lambda_{\max}|\}$ . Define the scaling map $t_{\alpha}:x\mapsto x/\alpha$ , so that under this map $[\lambda_{\min},\lambda_{\max}]\mapsto[\lambda_{\min}/\alpha,\lambda_{\max}/\alpha]$ . By assumption on $f$ there exists $x_{0}\in[-1,1]$ , $r\in(0,2]$ , $\delta\in(0,r]$ such that (i.) $[\lambda_{\min}/\alpha,\lambda_{\max}/\alpha]\subseteq[x_{0}-r,x_{0}+r]$ , (ii.) $f\circ t_{\alpha}^{-1}(x)=\sum_{l=0}^{\infty}a_{l}(x-x_{0})^{l}$ on $[x_{0}-r-\delta,x_{0}+r+\delta]$ and (iii.) $\sum_{l=0}^{\infty}(r+\delta)^{l}|a_{l}|\leq B$ for some $B>0$ .

By Proposition 3.10, given polynomial approximation error tolerance $\varepsilon_{\text{poly}}$ there exists a polynomial $Q\in\mathbb{C}[x]$ of degree $\mathcal{O}\left(\frac{1}{\delta}\log\left(\frac{B}{\varepsilon_{\text{poly}}}% \right)\right)$ which $\varepsilon_{\text{poly}}$ -approximates $f\circ t_{\alpha}^{-1}$ on $[x_{0}-r,x_{0}+r]$ and is bounded above by $1+B$ on $[-1,1]$ . Since $\|A/\alpha\|\in[\lambda_{\min}/\alpha,\lambda_{\max}/\alpha]\subseteq[x_{0}-r,% x_{0}+r]$ , we have

\displaystyle\left\|f\circ t_{\alpha}^{-1}\left(\frac{A}{\alpha}\right)-Q\left% (\frac{A}{\alpha}\right)\right\|\leq\|f\circ t_{\alpha}^{-1}(x)-Q(x)\|_{[x_{0}% -r,x_{0}+r]}\leq\varepsilon_{\text{poly}}.

In order to apply Theorem 3.9, our polynomial has to be real and upper-bounded by $1/2$ on $[-1,1]$ . Observe that for any complex-valued function $F$ and domain $S$ ,

\displaystyle\|F\|_{S}=\sup_{x\in S}|F(x)|=\sup_{x\in S}\sqrt{(\operatorname{% Re}F(x))^{2}+(\operatorname{Im}F(x))^{2}}\geq\sup_{x\in S}|\operatorname{Re}F(% x)|=\|\operatorname{Re}F\|_{S}.

Since $f$ itself is real-valued, $\operatorname{Re}Q\in\mathbb{R}[x]$ is qualified to assume the role of $P$ in Proposition 3.10. That is, the real polynomial $\operatorname{Re}Q$ also $\varepsilon_{\text{poly}}$ -approximates $f\circ t_{\alpha}^{-1}$ on $[x_{0}-r,x_{0}+r]$ and is bounded above by $1+B$ on $[-1,1]$ . Thus, letting $P\leftarrow\frac{\operatorname{Re}Q}{2(1+B)}$ in Theorem 3.9 we obtain $\tilde{U}$ , a $(1,a+2,4d\sqrt{\varepsilon/\alpha})$ -encoding of $\frac{\operatorname{Re}Q}{2(1+B)}(A/\alpha)$ , where $d=\mathcal{O}\left(\frac{1}{\delta}\log\left(\frac{B}{\varepsilon_{\text{poly}% }}\right)\right)$ . Putting these together and noting that $f\circ t_{\alpha}^{-1}(\frac{A}{\alpha})=f(A)$ , we have

	$\displaystyle\left\\|\frac{f(A)}{2(1+B)}-(\bra{0^{a+2}}\otimes I)\tilde{U}(\ket% {0^{a+2}}\otimes I)\right\\|$
	$\displaystyle\qquad\leq\left\\|\frac{f\circ t_{\alpha}^{-1}(\frac{A}{\alpha})}{% 2(1+B)}-\frac{\operatorname{Re}Q(\frac{A}{\alpha})}{2(1+B)}\right\\|+\left\\|% \frac{\operatorname{Re}Q(\frac{A}{\alpha})}{2(1+B)}-(\bra{0^{a+2}}\otimes I)% \tilde{U}(\ket{0^{a+2}}\otimes I)\right\\|$
	$\displaystyle\qquad\leq\frac{\varepsilon_{\text{poly}}}{2(1+B)}+4d\sqrt{% \varepsilon/\alpha}.$

Thus, choosing $U_{f}=\tilde{U}$ gives us a $\left(2(1+B),\;a+2,\;\varepsilon_{\text{poly}}+2(1+B)(4d\sqrt{\varepsilon/% \alpha})\right)$ -encoding of $f(A)$ . ∎

4 Implementation on quantum computers

In this section, we provide a quantum algorithm implementing the quantum Esscher Transform, based on block-encodings and QSVT. We assume the inputs come in the form of block-encodings. Our algorithm outputs the Esscher-transformed state in block-encoded form (and subsequent translations to the physical state itself).

Reference [GSLW19] demonstrates how to construct block-encodings for density operators $\rho$ within the purified quantum query-access model (see Definition 3.3 and Proposition 3.4 above). For the Hermitian operators $H_{i}$ which are generally not density operators, their block-encodings can be constructed efficiently for many physical Hamiltonians, or if the $H_{i}$ ’s are stored in sparse data structures or KP trees. Along the way we shall also need as an auxiliary tool ‘state-preparation pairs’ (see Definition 3.5), to prepare linear combinations of the Hamiltonians. We assume immediate access to these, as we do for block-encodings. For the construction of state-preparation pairs, one can refer to [vAG18].

4.1 Technical lemmas

The logarithm of the density matrix $\rho$ is a key ingredient of the quantum Esscher transform. Here we provide a technical lemma on constructing a block-encoding of the logarithm of a density matrix from the block-encoding of that matrix.

Lemma 4.1 (Block-encoding of $\log\rho$ ).

Given $U_{\rho}$ , a $(1,a,0)$ -BE of an $n$ -qubit density operator $\frac{1}{\kappa}\leq\rho\leq 1$ , where $\kappa>1$ , and polynomial approximation error tolerance $\varepsilon_{\text{poly}}>0$ . Then we have a $\left(2(1+\log 2\kappa),\;a+2,\;\varepsilon_{\text{poly}}\right)$ -BE of $\log\rho$ , the construction of which makes $\mathcal{O}\left(\kappa\log\left(\frac{\log\kappa}{\varepsilon_{\text{poly}}}% \right)\right)$ queries to $U_{\rho}$ .

Proof.

First we construct a polynomial approximation of $\log x$ . More specifically, we check that the function $\log x$ satisfies the conditions of Proposition 3.10, with the appropriate $x_{0},r,\delta$ and $B$ . Corollary 3.11 then gives us the desired block-encoding.

The following derivation is based on the proof of Corollary 67, [GSLW19] and Lemma 11, [GL19]. Negative power functions $x^{-c}$ share with $\log x$ the common property of going to infinity as $x$ approaches $0$ , thus the Taylor expansions of these functions are performed about $x=1$ . Choose $x_{0}=1$ , $r=1-\frac{1}{\kappa}$ and $\delta=\frac{1}{2\kappa}$ . The Taylor series of $\log x$ about $x=1$ is $\log x=\sum_{k=1}^{\infty}\frac{(-1)^{k+1}}{k}(x-1)^{k}$ . With $a_{k}=\frac{(-1)^{k+1}}{k}$ , the series-of-coefficients bound $B$ in Proposition 3.10 is

\displaystyle\sum_{k=1}^{\infty}(r+\delta)^{k}|a_{k}|=\sum_{k=1}^{\infty}\frac% {(1-1/2\kappa)^{k}}{k}=\sum_{k=1}^{\infty}\frac{(-1)^{k}}{k}\left(\frac{1}{2% \kappa}-1\right)^{k}=-\log\frac{1}{2\kappa}=\log 2\kappa=:B.

Corollary 3.11 gives us the unitary $U_{\log\rho}$ , which is a $\left(2(1+\log 2\kappa),\;a+2,\;\varepsilon_{\text{poly}}\right)$ -encoding of $\log\rho$ , which can be constructed using $\mathcal{O}\left(\kappa\log\left(\frac{\log\kappa}{\varepsilon_{\text{poly}}}% \right)\right)$ queries to $U_{\rho}$ . ∎

Next, we provide a lemma to construct the block-encoding of an exponentiated matrix from the block-encoding of that matrix.

Lemma 4.2 (Block-encoding of $e^{H}$ ).

Given $U_{H}$ , a $(\alpha,a,\varepsilon)$ -BE of $H$ and polynomial approximation error tolerance $\varepsilon_{\text{poly}}>0$ , there is a $\left(4,\;a+2,\;\varepsilon_{\text{poly}}+16t\sqrt{\varepsilon/\alpha}\right)$ -BE of $e^{H}/e^{\alpha}$ , constructible using $t$ queries to $U_{H}$ . Here

\displaystyle t=\mathcal{O}\left(\sqrt{\max(\alpha,\log\frac{1}{\varepsilon_{% \text{poly}}})\log\frac{1}{\varepsilon_{\text{poly}}}}\right).

Proof.

By Corollary 64, [GSLW19], there exists $P\in\mathbb{R}[x]$ of degree $t=\mathcal{O}\left(\sqrt{\max(\alpha,\log\frac{1}{\varepsilon_{\text{poly}}})% \log\frac{1}{\varepsilon_{\text{poly}}}}\right)$ such that $\|\frac{e^{\alpha x}}{e^{\alpha}}-P(x)\|_{[-1,1]}\leq\varepsilon_{\text{poly}}$ . Furthermore $\|P(x)\|\leq\|\frac{e^{\alpha x}}{e^{\alpha}}-P(x)\|_{[-1,1]}+\|\frac{e^{% \alpha x}}{e^{\alpha}}\|_{[-1,1]}\leq 1+B$ , where $B=1$ . Applying Corollary 3.11 with $f(x)=\frac{e^{x}}{e^{\alpha}}$ gives a $\left(4,\;a+2,\;\varepsilon_{\text{poly}}+16t\sqrt{\varepsilon/\alpha}\right)$ -encoding of $e^{H}/e^{\alpha}$ , making $t$ queries to $U_{H}$ . ∎

4.2 Algorithm

We now provide the algorithm implementing the quantum Esscher transform, see Algorithm 1. We specify the constraints on the inputs and the guarantees on the output in the algorithm itself. A step-by-step analysis of Algorithm 1 is provided below in detail, whereafter the overall (query) complexity is stated. We summarize these information in Theorem 4.3.

Theorem 4.3.

Let us be given the block-encodings of $\rho$ and $H_{j}$ , $j\in[d]$ , parameters $\theta\in\mathbb{R}^{d}$ and error tolerance $\varepsilon$ as specified in Algorithm 1. Then Algorithm 1 outputs an $\varepsilon$ -approximate block-encoding of the (subnormalized) quantum Esscher transform $\sigma=\frac{e^{\sum_{i}\theta_{i}H_{i}+\log\rho}}{\mathcal{N}},$ making

\widetilde{\mathcal{O}}\left(\kappa\log^{2}\left(\frac{1}{\varepsilon}\right)\right)

queries to $U_{\rho}$ and

\mathcal{O}\left(\log\frac{1}{\varepsilon}\right)

queries to each $U_{j}$ .

Algorithm 1 Quantum Esscher Transform via QSVT – QEsscher(

\rho,H,\theta

)

2:- Unitary

O_{\rho}

preparing the purification of the

n

-qubit density operator

\frac{1}{\kappa}\leq\rho\leq 1

using

n_{\rho}

ancillary qubits

3:- Quantum circuits

U_{j}

which are

(1,a,\varepsilon_{\text{BE}})

-BEs of

H_{j}

for

j\in[d]

, where

\varepsilon_{\text{BE}}=\left(\frac{\varepsilon}{8\log\frac{1}{\varepsilon}}% \right)^{2}

4:- Parameters

\theta\in\mathbb{R}^{d}

5:- Output block-encoding error

0<\varepsilon<2^{-\|\theta\|_{1}-2(1+\log 2\kappa)}

6:A

(1,\;\max\{a,n+n_{\rho}\}+\lceil\log d\rceil+4,\;\varepsilon)

-BE of

\sigma=\frac{e^{\sum_{i}\theta_{i}H_{i}+\log\rho}}{\mathcal{N}},

where

\mathcal{N}=e^{\|\theta\|_{1}+2(1+\log 2\kappa)}

is a subnormalization factor.

7:Use

O_{\rho}

to construct

U_{\rho}

, a a

(1,n+n_{\rho},0)

-BE of

\rho

8:Construct

U_{\log\rho}

, a

(2(1+\log 2\kappa),\;n+n_{\rho}+2,\;\varepsilon_{\text{BE}})

-BE of

\log\rho

. This makes

t=\mathcal{O}\left(\kappa\log\left(\frac{\log\kappa}{\varepsilon_{\text{BE}}}% \right)\right)

queries to

U_{\rho}

, see Lemma 4.1.

9:Construct the

(\beta,b,\varepsilon_{\text{SP}})

-state-preparation-pair

(P_{L},P_{R})

for

\alpha\odot\theta

, where

10:

\beta\leftarrow\|\theta\|_{1}+2(1+\log 2\kappa)

11:

b\leftarrow\lceil\log d\rceil

12:

\varepsilon_{\text{SP}}\leftarrow\beta\varepsilon_{\text{BE}}

13:Using

(P_{L},P_{R})

, combine

U_{\log\rho}

and

U_{j}

j\in[d]

to give

U_{H}

, a

(\beta,\;\max\{a,n+n_{\rho}\}+2+\lceil\log d\rceil,\;2\beta\varepsilon_{\text{% BE}})

-BE of

H:=\sum_{i}\theta_{i}H_{i}+\log\rho

. This makes 1 query to

(P_{L},P_{R})

and 1 query to

U_{\log\rho}

and each

U_{j}

, see Proposition 3.7.

14:Construct

U_{\sigma}

, a

(1,\;\max\{a,n+n_{\rho}\}+4+\lceil\log d\rceil,\;\varepsilon)

-BE of

\sigma:=e^{H}/\mathcal{N}

. Makes

t=\mathcal{O}\left(\log\frac{1}{\varepsilon}\right)

queries to

U_{H}

, see Lemma 4.2.

15:return

U_{\sigma}

Proof of Theorem 4.3.

Now we analyze the steps of Algorithm 1 in more detail to give the query complexity of QEsscher( $\rho,H,\theta$ ).

Step 1. From Proposition 3.4 we construct $U_{\rho}=\widetilde{O_{\rho}}:=(O_{\rho}^{\dagger}\otimes I_{n})(I_{n+n_{\rho}% }\otimes\text{SWAP}_{n})(O_{\rho}\otimes I_{n})$ , a $(1,n+n_{\rho},0)$ -BE of $\rho$ . This makes $\mathcal{O}(1)$ queries to $O_{\rho}$ .

Step 2. This step entails a polynomial approximation to the logarithm function on the interval $[\frac{1}{\kappa},1]$ . Denote by $\varepsilon_{\text{poly}}$ the approximation error tolerance. Choose $\varepsilon_{\text{poly}}\leq\varepsilon_{\text{BE}}$ . Lemma 4.1 gives $U_{\log\rho}$ , a $(2(1+\log 2\kappa),\;n+n_{\rho}+2,\;\varepsilon_{\text{BE}})$ -BE of $\log\rho$ . The construction of $U_{\log\rho}$ makes $t=\mathcal{O}\left(\kappa\log\left(\frac{\log\kappa}{\varepsilon_{\text{BE}}}% \right)\right)$ queries to $U_{\rho}$ , where $t$ is the degree of the approximating polynomial (see Proposition 3.10/Corollary 3.11).

Step 3. Construct a $(\beta,b,\varepsilon_{\text{SP}})$ -state-preparation-pair $(P_{L},P_{R})$ for $\alpha\odot\theta\in\mathbb{R}^{d+1}$ , where $\alpha=(1^{d},2(1+\log 2\kappa))$ and $\theta=(\theta_{1},\dots,\theta_{d},1)$ (see Proposition 3.7). Choose $\beta=\|\alpha\odot\theta\|_{1}=\|\theta\|_{1}+2(1+\log 2\kappa)$ . $b$ has to be such that $d+1\leq 2^{b}$ , so choose $b=\lceil\log d\rceil$ . Finally, choose $\varepsilon_{\text{SP}}\leq\beta\varepsilon_{\text{BE}}$ . The construction of $(P_{L},P_{R})$ can be achieved using $\mathcal{O}(d)$ elementary gates [BCC ${}^{+}$ 15].

Step 4. Now we make use of our access to the state-preparation-pair $(P_{L},P_{R})$ . To form linear combinations of block-encodings, the number of ancilla qubits required for each constituent block-encoding should be the same, see Proposition 3.6/3.7. Remark 3.2 shows that we can always equalize this number of ancilla qubits by padding with additional ancillas. The equalized number of ancillas is $\max\{a,n+n_{\rho}+2\}\leq\max\{a,n+n_{\rho}\}+2$ . We could also take $a+n+n_{\rho}+2$ , but we want to minimize the number of ancilla qubits. From Proposition 3.7 we get $U_{H}$ , a $(\beta,\;\max\{a,n+n_{\rho}\}+2+\lceil\log d\rceil,\;2\beta\varepsilon_{\text{% BE}})$ -BE of $H:=\sum_{i}\theta_{i}H_{i}+\log\rho$ , making 1 query to $(P_{L},P_{R})$ and 1 query to $U_{\log\rho}$ and each $U_{j}$ .

Step 5. Finally, we construct a block-encoding for $e^{H}/\mathcal{N}$ . At this stage, we have a $(\beta,\;\max\{a,n+n_{\rho}\}+2+\lceil\log d\rceil,\;2\beta\varepsilon_{\text{% BE}})$ -BE of $H$ . Lemma 4.2 gives a $(1,\;\max\{a,n+n_{\rho}\}+\lceil\log d\rceil+4,\;\varepsilon_{\text{poly}}/4+4% t\sqrt{2\varepsilon_{\text{BE}}})$ -BE of $\sigma=e^{H}/{4e^{\beta}}$ (thus $\mathcal{N}=4e^{\beta}$ ), where $t=\mathcal{O}\left(\sqrt{\max(\beta,\log\frac{1}{\varepsilon_{\text{poly}}})% \log\frac{1}{\varepsilon_{\text{poly}}}}\right)$ . It remains to make judicious choices for $\varepsilon_{\text{poly}}$ (note that the $\varepsilon_{\text{poly}}$ at this step need not be the same as the one in Step 2) and $\varepsilon_{\text{BE}}$ in order to ensure the overall block-encoding error is less than $\varepsilon$ , i.e.

\displaystyle\frac{\varepsilon_{\text{poly}}}{4}+4t\sqrt{2\varepsilon_{\text{% BE}}}\leq\varepsilon.

(4.1)

Now given a sufficently small $\varepsilon$ such that $\varepsilon\leq 2^{-\beta}$ , choose $\varepsilon_{\text{poly}}=\min\{\varepsilon,2^{-\beta}\}=\varepsilon$ and

\varepsilon_{\text{BE}}=\left(\frac{\varepsilon}{8\log\frac{1}{\varepsilon}}% \right)^{2}.

These choices ensure Equation 4.1 is satisfied. Note that $\lim_{x\rightarrow 0}\frac{x}{\log\frac{1}{x}}=0$ , so $\varepsilon_{\text{BE}}\rightarrow 0$ as $\varepsilon\rightarrow 0$ . The degree of the approximating polynomial, and thus the number of queries to $U_{H}$ required, is $t=\mathcal{O}\left(\sqrt{\max(\beta,\log\frac{1}{\varepsilon_{\text{poly}}})% \log\frac{1}{\varepsilon_{\text{poly}}}}\right)=\mathcal{O}\left(\log\frac{1}{% \varepsilon}\right)$ . Recall that constructing $U_{H}$ itself makes 1 query to $U_{\log\rho}$ and each $U_{j}$ . Lastly, observe that $\|e^{H}\|\leq e^{\|H\|}\leq e^{\sum_{i}|\theta_{i}|+\log\kappa}\leq e^{\beta}<% \mathcal{N}$ , so $\mathcal{N}$ is a valid subnormalization factor.

Overall complexity: $U_{\sigma}$ makes $\mathcal{O}(\log\frac{1}{\varepsilon})$ queries to $U_{H}$ . $U_{H}$ queries $U_{\log\rho}$ and each $U_{j}$ exactly once, and $U_{\log\rho}$ in turn makes $\mathcal{O}\left(\kappa\log\left(\frac{\log\kappa}{\varepsilon_{\text{BE}}}% \right)\right)$ queries to $U_{\rho}$ . Accordingly, the implementation of $U_{\sigma}$ makes

\mathcal{O}\left(\log\frac{1}{\varepsilon}\right)\cdot\mathcal{O}\left(\kappa% \log\left(\log\kappa\cdot\frac{1}{\varepsilon^{2}}\cdot\log^{2}\frac{1}{% \varepsilon}\right)\right)\subseteq\mathcal{O}\left(\kappa\log\left(\frac{\log% \kappa}{\varepsilon}\right)\log\left(\frac{1}{\varepsilon}\right)\right)% \subseteq\widetilde{\mathcal{O}}\left(\kappa\log^{2}\left(\frac{1}{\varepsilon% }\right)\right)

queries to $U_{\rho}$ and $\mathcal{O}\left(\log\frac{1}{\varepsilon}\right)$ queries to each $U_{j}$ , thus

\mathcal{O}\left(d\log\frac{1}{\varepsilon}\right)

queries to $\{U_{j}\}_{j=1}^{d}$ , the constraint operators collectively considered. ∎

4.3 Further discussion

If the positive definite $\rho\in\mathbb{C}^{N\times N}$ is full rank, the condition number is $\kappa\geq N$ since the eigenvalue lower bound $\frac{1}{\kappa}$ must be $\leq 1/N$ . Then the $U_{\rho}$ -query complexity grows at least linearly with $N$ . Hence, our Esscher transform is most relevant for low-rank cases. Assume we have $r$ non-zero eigenvalues $\geq 1/\kappa$ . As a consequence $r\leq\kappa$ holds. While the condition number can still be exponential if the smallest eigenvalue is exponentially small, when the smallest eigenvalue is $1/{\rm poly}(r)$ , we obtain a well-behaved query complexity. In addition we can allow for smaller eigenvalues, especially when we are interested only in low-rank approximations of the Esscher transform. Let $1/\kappa_{\rm eff}\geq 1/\kappa$ , with the effective condition number $\kappa_{\rm eff}$ . With slight adaptations, our method can implement the Esscher transform on the effectively well-conditioned subspace, while leaving the other part undefined. This incurs an error compared to the full Esscher transform proportional to the importance of the neglected eigenvalues, but may be acceptable in many practical situations. Recall that low-rank approximations are frequently performed in statistics and machine learning.

If the desired output model is a normalized state, one can apply similar techniques for Gibbs sampling to extract the normalized Esscher-transformed state from the output of Algorithm 1. We briefly describe this procedure and the overhead cost it incurs. More details can be found in Chapter 3 of [Gil19]. Let $\varepsilon>0$ denote the desired precision in trace distance between our approximate output and the ideal state. First, we prepare a maximally entangled state on two registers. Use Algorithm 1 to construct a $1$ -block-encoding $U$ of $e^{\frac{\sum_{i}\theta_{i}H_{i}+\log\rho}{2}}/\sqrt{\mathcal{N}}$ where $\mathcal{N}=e^{\|\theta\|_{1}+2(1+\log 2\kappa)}$ , with block-encoding error $0<\varepsilon_{1}<\varepsilon/N^{2}$ . Then apply $U$ to the second register to obtain a state $|\psi\rangle$ , so that tracing out the first register yields an approximate subnormalized state with trace distance error of $\mathcal{O}\left(\varepsilon/N\right)$ . That is,

\displaystyle\left\|\operatorname{Tr}_{1}(\bra{0}\otimes I)\ket{\psi}\bra{\psi% }(\ket{0}\otimes I)-\frac{e^{\sum_{i}\theta_{i}H_{i}+\log\rho}}{N\mathcal{N}}% \right\|_{T}=\mathcal{O}\left(\frac{\varepsilon}{N}\right).

With $\mathcal{Z}:=\operatorname{Tr}\left(e^{\sum_{i}\theta_{i}H_{i}+\log\rho}\right)$ , this state, when postselected after $\mathcal{O}\left(\sqrt{\frac{N\mathcal{N}}{\mathcal{Z}}}\log\frac{1}{% \varepsilon}\right)$ steps of fixed-point amplitude amplification (refer to Theorem 27 in [GSLW19]), results in a density operator $\varepsilon$ -close to the normalized Esscher-transformed state

\displaystyle\frac{e^{\sum_{i}\theta_{i}H_{i}+\log\rho}}{\operatorname{Tr}(e^{% \sum_{i}\theta_{i}H_{i}+\log\rho})}

in trace distance. Taking this overhead cost into account and assuming $\varepsilon$ is sufficiently small (such that the block-encoding error satisfies $\varepsilon_{1}<2^{-\|\theta\|_{1}-2(1+\log 2\kappa)}$ ), the total query complexity of preparing the approximate Esscher-transformed state is

\displaystyle\widetilde{\mathcal{O}}\left(\kappa\log^{2}\left(\frac{N^{2}}{% \varepsilon}\right)\right)\cdot\mathcal{O}\left(\sqrt{\frac{N\mathcal{N}}{% \mathcal{Z}}}\log\frac{1}{\varepsilon}\right)\subseteq\widetilde{\mathcal{O}}% \left(\kappa\sqrt{\frac{N\mathcal{N}}{\mathcal{Z}}}\log^{3}\left(\frac{1}{% \varepsilon}\right)\right).

5 Conclusion

In this paper, we considered a minimum relative entropy problem for the density operator subject to equality constraints. We formally solved this problem and the solution form inspired us to define the Quantum Esscher Transform (QUEST), a generalization of the classical Esscher transform to the quantum setting. We discussed its implementation on fault-tolerant quantum computers, leveraging techniques based on the QSVT framework. Given as inputs block-encodings of the initial quantum state and the constraint operators, the algorithm outputs an $\varepsilon$ -approximate block-encoding of the Esscher-transformed state with $U_{\rho}$ -query complexity

\displaystyle\mathcal{O}\left(\kappa\log\left(\frac{\log\kappa}{\varepsilon}% \right)\log\left(\frac{1}{\varepsilon}\right)\right)\subseteq\widetilde{% \mathcal{O}}\left(\kappa\log^{2}\left(\frac{1}{\varepsilon}\right)\right)

and $\{U_{j}:j\in[d]\}$ -query complexity

\mathcal{O}\left(d\log\frac{1}{\varepsilon}\right).

Several avenues remain open for future work:

•

Is there a quantum algorithmic framework that can fully solve the minimum relative entropy problem? Our current approach only presents the formal solution for the optimal parameter $\lambda^{*}$ . Approaches such as Newton’s algorithm with backtracking was suggested in [ZTF13], the quantized version of which could be studied. Additionally, [AAKS20] demonstrated that $\lambda^{*}$ can, in principle, be found with a convex optimization program. Can we design a quantum algorithm to effectively address this problem?
•

One could explore strategies for alternative input models. Our current work exclusively considered the purified access model, wherein the preparation of the purification of the input state was assumed. In contrast, the sampling access model, which assumes multiple independent copies of the input state, is another commonly used model. Gilyén et al. [GP22] has proposed an approach to implement approximate block-encodings of $\rho$ , starting with sample access. This approach is based on a combination of density matrix exponentiation [LMR14, KLL ${}^{+}$ 17] and QSVT, and allows us to implement the quantum Esscher transform in the sampling access model. We leave the total cost of this procedure for further analysis.
•

In Section 2.2.3, we noted potential connections between the quantum Esscher transform and imaginary-time evolution. To give these substance, further investigation is required.
•

Various applications could be envisioned for the quantum Esscher transform. Its classical version has found usage for numerous problems in domains such as statistics, machine learning, and finance. These problems have quantum analogues, which could benefit from the quantum Esscher transform and its implementation on quantum computers.

Acknowledgments

The authors would like to thank Po-Wei Huang, Xiufan Li, Zhan Yu, Serge Massar and Roberto Rubboli for helpful discussions. This work is supported by the National Research Foundation, Singapore, and A*STAR under its CQT Bridging Grant and its Quantum Engineering Programme under grant NRF2021-QEP2-02-P05. KK acknowledges support from Leong Chuan Kwek, under project grant R-710-000-007-135.

Appendix A Proof of Theorem 2.2

Before delving into the proof, we introduce some notation and state a lemma to facilitate its presentation. The exponential family of $P$ with respect to the random variable $X$ is the set of measures

\displaystyle\Lambda=\left\{\frac{e^{\lambda\cdot X}P}{\mathbb{E}_{P}[e^{% \lambda\cdot X}]}:\lambda\in\mathbb{R}^{d}\right\}.

Also, let

\displaystyle M=\{Q:\mathbb{E}_{Q}[X]=m\}.

Lemma A.1.

(Proposition 3.24 – [FS11]) Let $P$ be a probability measure on $(\Omega,\Sigma)$ and $X$ be a random variable on $\Omega$ . Fix $m\in\mathbbm{R}^{d}$ . Then for any probability measure $Q$ on $(\Omega,\Sigma)$ satisfying $\mathbb{E}_{Q}[X]=m$ , we have

\displaystyle D(Q\|P)\geq\sup_{\lambda\in\mathbb{R}^{d}}\left[\lambda\cdot m-% \log\mathbb{E}_{P}[e^{\lambda\cdot X}]\right].

(A.1)

Moreover the inequality is saturated if $Q=Q_{\lambda^{\prime}}:=e^{\lambda^{\prime}\cdot X}P/\mathbb{E}_{P}[e^{\lambda% ^{\prime}\cdot X}]\in\Lambda\cap M$ for some $\lambda^{\prime}\in\mathbb{R}^{d}$ :

\displaystyle D(Q_{\lambda^{\prime}}\|P)=\lambda^{\prime}\cdot m-\log\mathbb{E% }_{P}[e^{\lambda^{\prime}\cdot X}]=\sup_{\lambda\in\mathbb{R}^{d}}\left[% \lambda\cdot m-\log\mathbb{E}_{P}[e^{\lambda\cdot X}]\right].

(A.2)

Proof.

Each $\lambda\in\mathbb{R}^{d}$ gives rise to a corresponding $Q_{\lambda}\in\Lambda$ (note that $Q_{\lambda}$ need not be in $M$ ). Then for any arbitrary $Q$ , we have

$\displaystyle D(Q\\|P)$	$\displaystyle=$	$\displaystyle\sum_{\omega\in\Omega}Q(\omega)\log\frac{Q(\omega)}{Q_{\lambda}(% \omega)}\frac{Q_{\lambda}(\omega)}{P(\omega)}$
	$\displaystyle=$	$\displaystyle D(Q\\|Q_{\lambda})+\sum_{\omega\in\Omega}Q(\omega)\log\frac{Q_{% \lambda}(\omega)}{P(\omega)}$
$\displaystyle(\text{by Jensen, $D(Q\\|P)\geq 0$})$	$\displaystyle\geq$	$\displaystyle\sum_{\omega\in\Omega}Q(\omega)\log\frac{Q_{\lambda}(\omega)}{P(% \omega)}$
	$\displaystyle=$	$\displaystyle\sum_{\omega\in\Omega}Q(\omega)\log\frac{e^{\lambda\cdot X(\omega% )}}{\mathbb{E}_{P}[e^{\lambda\cdot X}]}$
	$\displaystyle=$	$\displaystyle E_{Q}[\lambda\cdot X]-\log\mathbb{E}_{P}[e^{\lambda\cdot X}]$
	$\displaystyle=$	$\displaystyle\lambda\cdot m-\log\mathbb{E}_{P}[e^{\lambda\cdot X}].$

Since this holds for all $\lambda\in\mathbb{R}^{d}$ , we conclude that $D(Q\|P)\geq\sup_{\lambda\in\mathbb{R}^{d}}\left[\lambda\cdot m-\log\mathbb{E}_% {P}[e^{\lambda\cdot X}]\right]$ . Furthermore, if $\lambda^{\prime}\in\mathbb{R}^{d}$ is such that $Q_{\lambda^{\prime}}\in\Lambda\cap M$ , then letting $Q=Q_{\lambda^{\prime}}$ and rerunning the same argument sequence above gives

	$\displaystyle D(Q_{\lambda^{\prime}}\\|P)$	$\displaystyle=\sum_{\omega\in\Omega}Q_{\lambda^{\prime}}(\omega)\log\frac{Q_{% \lambda^{\prime}}(\omega)}{P(\omega)}$
		$\displaystyle=\sum_{\omega\in\Omega}Q_{\lambda^{\prime}}(\omega)\log\frac{e^{% \lambda^{\prime}\cdot X(\omega)}}{\mathbb{E}_{P}[e^{\lambda^{\prime}\cdot X}]}$
		$\displaystyle=E_{Q_{\lambda^{\prime}}}[\lambda^{\prime}\cdot X]-\log\mathbb{E}% _{P}[e^{\lambda^{\prime}\cdot X}]$
		$\displaystyle=\lambda^{\prime}\cdot m-\log\mathbb{E}_{P}[e^{\lambda^{\prime}% \cdot X}].$

∎

Proof of Theorem 2.2.

First, we have required $\min_{\omega\in\Omega}X_{i}(\omega)<m_{i}<\max_{\omega\in\Omega}X_{i}(\omega)$ because otherwise the constraints $\mathbb{E}_{Q}[X_{i}]=m_{i}$ cannot be satisfied. The Lagrangian function is

\displaystyle\mathcal{L}(Q,\lambda,\eta)=\sum_{\omega}Q(\omega)\log\frac{Q(% \omega)}{P(\omega)}-\sum_{i=1}^{d}\lambda_{i}\left(\sum_{\omega}Q(\omega)X_{i}% (\omega)-m_{i}\right)-\eta\left(\sum_{\omega}Q(\omega)-1\right).

Setting the first-order derivatives of $\mathcal{L}(Q,\lambda,\eta)$ with respect to $Q(\omega)$ to zero gives

\displaystyle Q^{\star}(\omega)=\frac{e^{\lambda^{\star}\cdot X(\omega)}P(% \omega)}{\mathbb{E}_{P}[e^{\lambda^{\star}\cdot X}]},

where $\lambda^{\star}$ is to be determined from the $d$ constraints $\mathbb{E}_{Q}[X]=m$ :

$\displaystyle\mathbb{E}_{Q}[X]=m$	$\displaystyle\iff\frac{\mathbb{E}_{P}[Xe^{\lambda^{\star}\cdot X}]}{\mathbb{E}% _{P}[e^{\lambda^{\star}\cdot X}]}-m=0$	(A.4)
	$\displaystyle\iff\frac{\mathbb{E}_{P}[(X-m)e^{\lambda^{\star}\cdot(X-m)}]}{% \mathbb{E}_{P}[e^{\lambda^{\star}\cdot(X-m)}]}=0$
	$\displaystyle\iff\frac{\partial}{\partial\lambda}\log\mathbb{E}_{P}[e^{\lambda% \cdot(X-m)}]\|_{\lambda=\lambda^{\star}}=0$
	$\displaystyle\iff\frac{\partial}{\partial\lambda}\mathbb{E}_{P}[e^{\lambda% \cdot(X-m)}]\|_{\lambda=\lambda^{\star}}=0.$

The last equivalence holds because $\log f(x)$ and $f(x)$ share the same minimum/maximum points, provided $f(x)>0$ at those points. It remains to show $Q^{\star}$ indeed minimizes $D(Q\|P)$ , subject to the constraints $E_{Q}[X]=m$ . But this follows easily from Lemma A.1. Furthermore, since $x\mapsto x\log x$ is a strictly convex function, $D(Q\|P)$ is a strictly convex functional of $Q$ and so it can have at most one minimizer in the convex set $M$ , thereby showing the uniqueness of $Q^{\star}$ . Finally, again using Lemma A.1 we have $\lambda^{\star}=\operatorname*{argmax}_{\lambda\in\mathbb{R}^{d}}\left[\lambda% \cdot m-\log\mathbb{E}_{P}[e^{\lambda\cdot X}]\right]=\operatorname*{argmin}_{% \lambda\in\mathbb{R}^{d}}\left[\log\mathbb{E}_{P}[e^{\lambda\cdot(X-m)}]\right% ]=\operatorname*{argmin}_{\lambda\in\mathbb{R}^{d}}\mathbb{E}_{P}[e^{\lambda% \cdot(X-m)}]$ . ∎

Appendix B Wirtinger Calculus

The ‘Wirtinger Calculus’ provides a methodology for optimization problems involving complex matrices. It enables ‘differentiation as usual’ with respect to complex matrices. In this appendix, we state only the main definitions and results needed to solve Problem 2.4. For a more thorough exposition of this framework, we direct the reader to [KQKR23, Hjø11, KD09].

Consider functions of the form $f:\mathbb{C}^{n\times n}\longrightarrow\mathbb{C}$ . Since $\mathbb{C}$ is $\mathbb{R}^{2}$ endowed with the multiplication operation $(a,b)\times(c,d)\mapsto(ac-bd,ad+bc)$ , we can view

	$\displaystyle f:\;$	$\displaystyle\mathbb{R}^{2(n\times n)}\longrightarrow\mathbb{R}^{2}$
		$\displaystyle(x_{ij},y_{ij})_{i,j\in[n]}=(\mathbf{X},\mathbf{Y})\mapsto(u(% \mathbf{X},\mathbf{Y}),v(\mathbf{X},\mathbf{Y})).$

For $i=1,\dots,n$ regard $z_{ij},z_{ij}^{*}$ as functions from $\mathbb{R}^{n\times n}\times\mathbb{R}^{n\times n}$ to $\mathbb{C}$ , where $z_{ij}(\mathbf{X},\mathbf{Y})=x_{ij}+iy_{ij}$ and $z_{ij}^{*}(\mathbf{X},\mathbf{Y})=x_{ij}-iy_{ij}$ .³³3The notations $z,z^{*}$ may raise questions on independence. This is irrelevant—one may simply write $z_{1},z_{2}$ if one wishes. We emphasize that (for each $i,j$ ) the fundamental input variables are the two real numbers $x$ and $y$ . Then we have a function $\tilde{f}:\mathbb{C}^{n\times n}\times\mathbb{C}^{n\times n}\longrightarrow% \mathbb{C}$ such that

\displaystyle f(\mathbf{X},\mathbf{Y}):=\underline{\tilde{f}\circ(\mathbf{Z},% \mathbf{Z^{*}})}(\mathbf{X},\mathbf{Y})=\tilde{f}(\mathbf{Z}(\mathbf{X},% \mathbf{Y}),\mathbf{Z^{*}}(\mathbf{X},\mathbf{Y}))=\tilde{f}(\mathbf{X+iY},% \mathbf{X-iY}).

(B.1)

Partial differentiating $f$ with respect to each $x_{ij}$ and $y_{ij}$ , and then rearranging terms, we have for $1\leq i,j\leq n$

	$\displaystyle\frac{\partial\tilde{f}}{\partial z_{ij}}(\mathbf{Z}(\mathbf{X},% \mathbf{Y}),\mathbf{Z^{*}}(\mathbf{X},\mathbf{Y}))$	$\displaystyle=\frac{1}{2}\left(\frac{\partial f}{\partial x_{ij}}-i\frac{% \partial f}{\partial y_{ij}}\right)(\mathbf{X},\mathbf{Y})$		(B.2)
	$\displaystyle\frac{\partial\tilde{f}}{\partial z_{ij}^{}}(\mathbf{Z}(\mathbf{% X},\mathbf{Y}),\mathbf{Z^{}}(\mathbf{X},\mathbf{Y}))$	$\displaystyle=\frac{1}{2}\left(\frac{\partial f}{\partial x_{ij}}+i\frac{% \partial f}{\partial y_{ij}}\right)(\mathbf{X},\mathbf{Y}).$

To preserve the matrix structure of the parameters $z_{ij}$ and $z_{ij}^{*}$ we use the standard notation

\displaystyle\frac{\partial}{\partial\mathbf{Z}}:=\begin{bmatrix}\frac{% \partial}{\partial z_{11}}&\dots&\frac{\partial}{\partial z_{1n}}\\ \vdots&\ddots&\vdots\\ \frac{\partial}{\partial z_{n1}}&\dots&\frac{\partial}{\partial z_{nn}}\end{% bmatrix}\qquad\frac{\partial}{\partial\mathbf{Z^{*}}}:=\begin{bmatrix}\frac{% \partial}{\partial z_{11}^{*}}&\dots&\frac{\partial}{\partial z_{1n}^{*}}\\ \vdots&\ddots&\vdots\\ \frac{\partial}{\partial z_{n1}^{*}}&\dots&\frac{\partial}{\partial z_{nn}^{*}% }\end{bmatrix}

(B.9)

and similarly for $\frac{\partial}{\partial\mathbf{X}}$ and $\frac{\partial}{\partial\mathbf{Y}}$ . Then Equation B.2 is concisely stated as

	$\displaystyle\frac{\partial\tilde{f}}{\partial\mathbf{Z}}(\mathbf{Z}(\mathbf{X% },\mathbf{Y}),\mathbf{Z^{*}}(\mathbf{X},\mathbf{Y}))$	$\displaystyle=\frac{1}{2}\left(\frac{\partial f}{\partial\mathbf{X}}-i\frac{% \partial f}{\partial\mathbf{Y}}\right)(\mathbf{X},\mathbf{Y})$		(B.10)
	$\displaystyle\frac{\partial\tilde{f}}{\partial\mathbf{Z^{}}}(\mathbf{Z}(% \mathbf{X},\mathbf{Y}),\mathbf{Z^{}}(\mathbf{X},\mathbf{Y}))$	$\displaystyle=\frac{1}{2}\left(\frac{\partial f}{\partial\mathbf{X}}+i\frac{% \partial f}{\partial\mathbf{Y}}\right)(\mathbf{X},\mathbf{Y}).$

$\frac{\partial}{\partial\mathbf{Z}}$ and $\frac{\partial}{\partial\mathbf{Z^{*}}}$ are the matrix Wirtinger derivatives of $f$ . Often, we abuse notation and write both $f(\mathbf{X},\mathbf{Y})$ and $f(\mathbf{Z},\mathbf{Z^{*}})$ , so we can write

\displaystyle\frac{\partial}{\partial\mathbf{Z}}=\frac{1}{2}\left(\frac{% \partial}{\partial\mathbf{X}}-i\frac{\partial}{\partial\mathbf{Y}}\right),% \qquad\frac{\partial}{\partial\mathbf{Z^{*}}}=\frac{1}{2}\left(\frac{\partial}% {\partial\mathbf{X}}+i\frac{\partial}{\partial\mathbf{Y}}\right).

(B.11)

The following three propositions are all we need in this paper. We omit their proofs, which can all be found in [KQKR23].

Proposition B.1.

Let $f:\mathbb{C}^{n\times n}\longrightarrow\mathbb{R}$ be a real-valued function of complex matrices. Then $f$ has a stationary point at $\mathbf{Z}=[z_{ij}]_{i,j\in[n]}$ if and only if

\displaystyle\frac{\partial f}{\partial\mathbf{Z}}(\mathbf{Z})=0\quad\left(% \text{or equivalently}\;\;\frac{\partial f}{\partial\mathbf{Z^{*}}}(\mathbf{Z}% )=0\right).

Whether the solution of the above equation actually gives a minimum/maximum/saddle point has to be checked via additional considerations or by inspecting higher-order derivatives.

Proposition B.2.

Let $\mathbf{Z}$ be a complex, unstructured (see below) matrix and $F(z)=\sum_{n=0}^{\infty}c_{n}z^{n}$ be analytic. Define the scalar function $f(\mathbf{Z,Z^{*}}):=\operatorname{Tr}(F(\mathbf{Z}))$ . Then

\displaystyle\frac{\partial\operatorname{Tr}(F(\mathbf{Z}))}{\partial\mathbf{Z% }}=F^{\prime}(\mathbf{Z})^{T}

where $F^{\prime}(\cdot)$ is the complex derivative of $F(\cdot)$ .

So far, by writing $f:\mathbb{C}^{n\times n}\longrightarrow\mathbb{C}$ we have implicitly assumed the input matrices have independent components (we call such matrices ‘unstructured’). This condition often does not hold, e.g. when our matrices of interest are symmetric/Hermitian etc. To obtain the correct Wirtinger derivatives with respect to structured matrices, we resort to the chain rule.

Proposition B.3 (Wirtinger derivatives with respect to Hermitian matrices).

Let $f(\mathbf{Z,Z^{*}})$ be a function of complex Hermitian matrices. Then the Wirtinger derivatives of $f$ with respect to $\mathbf{Z,Z^{*}}$ are given by

\displaystyle\frac{\partial f}{\partial\mathbf{Z}}=\left[\frac{\partial f}{% \partial\mathbf{\tilde{Z}}}+\left(\frac{\partial f}{\partial\mathbf{\tilde{Z}^% {*}}}\right)^{T}\right]_{\mathbf{\tilde{Z}}=\mathbf{Z}}\qquad\text{and}\qquad% \frac{\partial f}{\partial\mathbf{Z^{*}}}=\left[\frac{\partial f}{\partial% \mathbf{\tilde{Z}^{*}}}+\left(\frac{\partial f}{\partial\mathbf{\tilde{Z}}}% \right)^{T}\right]_{\mathbf{\tilde{Z}}=\mathbf{Z}}.

Here, the tildes above $\mathbf{\tilde{Z},\tilde{Z}^{*}}$ indicate that they are unstructured matrices. Thus, to derive the Wirtinger derivatives with respect to Hermitian matrices, first obtain the Wirtinger derivative of $f$ , assuming the inputs are unstructured. Then form the correct expressions given above and reinstate the structured matrices $\mathbf{Z,Z^{*}}$ as the arguments.

References

[AAKS20] Anurag Anshu, Srinivasan Arunachalam, Tomotaka Kuwahara, and Mehdi Soleimanifar. Sample-efficient learning of quantum many-body systems. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 685–691. IEEE, 2020.
[BCC ${}^{+}$ 15] Dominic W Berry, Andrew M Childs, Richard Cleve, Robin Kothari, and Rolando D Somma. Simulating hamiltonian dynamics with a truncated taylor series. Physical review letters, 114(9):090502, 2015.
[BK91] Michael Berman and Ronnie Kosloff. Time-dependent solution of the liouville-von neumann equation: Non-dissipative evolution. Computer physics communications, 63(1-3):1–20, 1991.
[Bra96] Samuel L Braunstein. Geometry of quantum inference. Physics Letters A, 219(3-4):169–174, 1996.
[BSS23] Ahmad Beirami, Maziar Sanjabi, and Virginia Smith. On tilted losses in machine learning: Theory and applications. Journal of Machine Learning Research, 24:1–79, 2023.
[CGJ18] Shantanav Chakraborty, András Gilyén, and Stacey Jeffery. The power of block-encoded matrix powers: improved regression techniques via faster hamiltonian simulation. arXiv preprint arXiv:1804.01973, 2018.
[DMB ${}^{+}$ 23] Alexander M Dalzell, Sam McArdle, Mario Berta, Przemyslaw Bienias, Chi-Fang Chen, András Gilyén, Connor T Hann, Michael J Kastoryano, Emil T Khabiboulline, Aleksander Kubica, et al. Quantum algorithms: A survey of applications and end-to-end complexities. arXiv preprint arXiv:2310.03011, 2023.
[Esc32] F Escher. On the probability function in the collective theory of risk. Skand. Aktuarie Tidskr., 15:175–195, 1932.
[FS11] Hans Föllmer and Alexander Schied. Stochastic finance: an introduction in discrete time. Walter de Gruyter, 2011.
[Gil19] András Gilyén. Quantum singular value transformation & its algorithmic applications. PhD thesis, University of Amsterdam, 2019.
[GL19] András Gilyén and Tongyang Li. Distributional property testing in a quantum world. arXiv preprint arXiv:1902.00814, 2019.
[GLM08] Vittorio Giovannetti, Seth Lloyd, and Lorenzo Maccone. Quantum random access memory. Physical review letters, 100(16):160501, 2008.
[GP22] András Gilyén and Alexander Poremba. Improved quantum algorithms for fidelity estimation. arXiv preprint arXiv:2203.15993, 2022.
[GS ${}^{+}$ 93] Hans U Gerber, Elias SW Shiu, et al. Option pricing by Esscher transforms. HEC Ecole des hautes études commerciales, 1993.
[GSLW19] András Gilyén, Yuan Su, Guang Hao Low, and Nathan Wiebe. Quantum singular value transformation and beyond: exponential improvements for quantum matrix arithmetics. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 193–204, 2019.
[Hjø11] Are Hjørungnes. Complex-valued matrix derivatives: with applications in signal processing and communications. Cambridge University Press, 2011.
[HS06] Friedrich Hubalek and Carlo Sgarra. Esscher transforms and the minimal entropy martingale measure for exponential lévy models. Quantitative finance, 6(02):125–145, 2006.
[Jay57] Edwin T Jaynes. Information theory and statistical mechanics. Physical review, 106(4):620, 1957.
[KD09] Ken Kreutz-Delgado. The complex gradient operator and the cr-calculus. arXiv preprint arXiv:0906.4835, 2009.
[KLL ${}^{+}$ 17] Shelby Kimmel, Cedric Yen-Yu Lin, Guang Hao Low, Maris Ozols, and Theodore J Yoder. Hamiltonian simulation with optimal sample complexity. npj Quantum Information, 3(1):13, 2017.
[KQKR23] Kelvin Koor, Yixian Qiu, Leong Chuan Kwek, and Patrick Rebentrost. A short tutorial on Wirtinger Calculus with applications in quantum information. arXiv preprint arXiv:2312.04858, 2023.
[LC17] Guang Hao Low and Isaac L Chuang. Optimal hamiltonian simulation by quantum signal processing. Physical review letters, 118(1):010501, 2017.
[LC19] Guang Hao Low and Isaac L Chuang. Hamiltonian simulation by qubitization. Quantum, 3:163, 2019.
[LMR14] Seth Lloyd, Masoud Mohseni, and Patrick Rebentrost. Quantum principal component analysis. Nature Physics, 10(9):631–633, 2014.
[LYC16] Guang Hao Low, Theodore J Yoder, and Isaac L Chuang. Methodology of resonant equiangular composite quantum gates. Physical Review X, 6(4):041067, 2016.
[MJE ${}^{+}$ 19] Sam McArdle, Tyson Jones, Suguru Endo, Ying Li, Simon C Benjamin, and Xiao Yuan. Variational ansatz-based quantum simulation of imaginary time evolution. npj Quantum Information, 5(1):75, 2019.
[MRTC21] John M Martyn, Zane M Rossi, Andrew K Tan, and Isaac L Chuang. Grand unification of quantum algorithms. PRX Quantum, 2(4):040203, 2021.
[MST ${}^{+}$ 20] Mario Motta, Chong Sun, Adrian TK Tan, Matthew J O’Rourke, Erika Ye, Austin J Minnich, Fernando GSL Brandao, and Garnet Kin-Lic Chan. Determining eigenstates and thermal states on a quantum computer using quantum imaginary time evolution. Nature Physics, 16(2):205–210, 2020.
[OP07] Stefano Olivares and Matteo GA Paris. Quantum estimation via the minimum kullback entropy principle. Physical Review A, 76(4):042120, 2007.
[RF23] Patrick Rall and Bryce Fuller. Amplitude estimation from quantum signal processing. Quantum, 7:937, 2023.
[Sie76] David Siegmund. Importance sampling in the monte carlo study of sequential tests. The Annals of Statistics, pages 673–684, 1976.
[SJ80] John Shore and Rodney Johnson. Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy. IEEE Transactions on information theory, 26(1):26–37, 1980.
[vAG18] Joran van Apeldoorn and András Gilyén. Improvements in quantum sdp-solving with applications. arXiv preprint arXiv:1804.05058, 2018.
[Wil13] Mark M Wilde. Quantum information theory. Cambridge university press, 2013.
[ZTF13] Mattia Zorzi, Francesco Ticozzi, and Augusto Ferrante. Minimum relative entropy for quantum estimation: Feasibility and general solution. IEEE transactions on information theory, 60(1):357–367, 2013.

	$\displaystyle\left\\|A-\beta(\bra{0^{b+a}}\otimes I_{n})\widetilde{W}(\ket{0^{b% +a}}\otimes I_{n})\right\\|$	$\displaystyle=\left\\|A-\sum_{j=0}^{m-1}(\beta c_{j}^{*}d_{j}-\alpha_{j}y_{j}+% \alpha_{j}y_{j})\cdot(\bra{0^{a}}\otimes I_{n})U_{j}(\ket{0^{a}}\otimes I_{n})\right\\|$
		$\displaystyle\leq\sum_{j=0}^{m-1}\|\beta c_{j}^{*}d_{j}-\alpha_{j}y_{j}\|+\left% \\|A-\sum_{j=0}^{m-1}\alpha_{j}y_{j}(\bra{0^{a}}\otimes I_{n})U_{j}(\ket{0^{a}}% \otimes I_{n})\right\\|$
		$\displaystyle\leq\varepsilon_{\text{SP}}+\left\\|\sum_{j=0}^{m-1}y_{j}A_{j}-% \sum_{j=0}^{m-1}y_{j}\alpha_{j}(\bra{0^{a}}\otimes I_{n})U_{j}(\ket{0^{a}}% \otimes I_{n})\right\\|$
		$\displaystyle\leq\varepsilon_{\text{SP}}+\sum_{j=0}^{m-1}\|y_{j}\|\left\\|A_{j}-% \alpha_{j}(\bra{0^{a}}\otimes I_{n})U_{j}(\ket{0^{a}}\otimes I_{n})\right\\|$
		$\displaystyle\leq\varepsilon_{\text{SP}}+\sum_{j=0}^{m-1}\|y_{j}\|\varepsilon_{% \text{BE}}$
		$\displaystyle\leq\varepsilon_{\text{SP}}+\frac{\beta}{\inf_{j}\alpha_{j}}% \varepsilon_{\text{BE}}.$

The Quantum Esscher Transform

Abstract

1 Introduction

1.1 Preliminaries and notation

2 Quantum Esscher Transform

2.1 Esscher Transform

Definition 2.1 (Esscher Transform for probability distributions).

Theorem 2.2.

Remark 2.3.

2.2 Quantum version

2.2.1 Problem statement

Problem 2.4.

2.2.2 Solution

Theorem 2.5.

Proof.

Lemma 2.6.

Proof.

Lemma 2.7.

Proof.

Definition 2.8 (Quantum Esscher Transform).

Remark 2.9.

2.2.3 Connection to quantum imaginary time evolution

3 Overview on block-encodings and quantum singular value transformations

Definition 3.1 (Block-Encoding).

Remark 3.2.

Definition 3.3 (Purified quantum query-access).

Proposition 3.4 (Block-encoding of density operators – Lemma 45, [GSLW19]).

Definition 3.5 (State Preparation Pair).

Proposition 3.6 (Linear combination of block-encoded matrices – Lemma 52, [GSLW19]).

Proposition 3.7 (Generalized linear combination of block-encoded matrices).

Proof.

Remark 3.8.

Theorem 3.9 (Polynomial Eigenvalue Transformation – Theorem 56, [GSLW19]).

Proposition 3.10 (Bounded Polynomial Approximation – Corollary 66, [GSLW19]).

Corollary 3.11 (Block-encoding functions of general Hermitian matrices).

Proof.

4 Implementation on quantum computers

4.1 Technical lemmas

Lemma 4.1 (Block-encoding of log⁡ρ𝜌\log\rhoroman_log italic_ρ).

Proof.

Lemma 4.2 (Block-encoding of eHsuperscript𝑒𝐻e^{H}italic_e start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT).

Proof.

4.2 Algorithm

Theorem 4.3.

Proof of Theorem 4.3.

4.3 Further discussion

5 Conclusion

Acknowledgments

Appendix A Proof of Theorem 2.2

Lemma A.1.

Proof.

Proof of Theorem 2.2.

Appendix B Wirtinger Calculus

Proposition B.1.

Proposition B.2.

Proposition B.3 (Wirtinger derivatives with respect to Hermitian matrices).

References

Lemma 4.1 (Block-encoding of $\log\rho$ ).

Lemma 4.2 (Block-encoding of $e^{H}$ ).