Rate-Distortion-Perception Tradeoff for Gaussian Vector Sources

**g**g Qian, Sadaf Salehkalaibar, Jun Chen, Ashish Khisti, Wei Yu,
Wuxian Shi, Yiqun Ge and Wen Tong **g**g Qian and Jun Chen are with the Department of Electrical and Computer Engineering at McMaster University, Hamilton, ON L8S 4K1, Canada (email: {qianj40, chenjun}@mcmaster.ca).Sadaf Salehkalaibar, Ashish Khisti and Wei Yu are with the Department of Electrical and Computer Engineering at the University of Toronto, Toronto, M5S 3G4, Canada (email:{sadafs, akhisti, weiyu}@ece.utoronto.ca),Wuxian Shi, Yiqun Ge and Wen Tong are with the Ottawa Research Center, Huawei Technologies, Ottawa, ON K2K 3J1, Canada (email: {wuxian.shi, yiqun.ge, tongwen}@huawei.com)

Abstract

This paper studies the rate-distortion-perception (RDP) tradeoff for a Gaussian vector source coding problem where the goal is to compress the multi-component source subject to distortion and perception constraints. The purpose of imposing a perception constraint is to ensure visually pleasing reconstructions. This paper studies this RDP setting with either the Kullback-Leibler (KL) divergence or Wasserstein-2 metric as the perception loss function, and shows that for Gaussian vector sources, jointly Gaussian reconstructions are optimal. We further demonstrate that the optimal tradeoff can be expressed as an optimization problem, which can be explicitly solved. An interesting property of the optimal solution is as follows. Without the perception constraint, the traditional reverse water-filling solution for characterizing the rate-distortion (RD) tradeoff of a Gaussian vector source states that the optimal rate allocated to each component depends on a constant, called the water-level. If the variance of a specific component is below the water-level, it is assigned a zero compression rate. However, with active distortion and perception constraints, we show that the optimal rates allocated to the different components are always positive. Moreover, the water-levels that determine the optimal rate allocation for different components are unequal. We further treat the special case of perceptually perfect reconstruction and study its RDP function in the high-distortion and low-distortion regimes to obtain insight to the structure of the optimal solution.

Index Terms:

Rate-distortion-perception function, lossy source coding, lossy compression, Gaussian vector sources, reverse water-filling

I Introduction

The rate-distortion-perception (RDP) function is a generalization of Shannon’s rate-distortion function that incorporates an additional perception loss function which measures the distance between the distributions of the source and the reconstruction. It has been observed that in the neural compression framework [1, 2, 3, 4], improving realism in the reconstruction comes at the price of increased distortion. In this framework, realism is controlled by a perception loss function between the distributions of the source and the reconstruction, while distortion is controlled via a standard distortion loss function on the samples of the source and its reconstruction, e.g., in terms of mean squared error. The RDP function introduced in Blau and Michaeli [5] formalizes this tradeoff.

The extension of classical rate-distortion (RD) theory to incorporate constraints on the distribution of the reconstruction samples has been studied in various works in the information theory literature; see e.g., [6] and references therein. More recently, Theis and Wagner [7] present a one-shot coding theorem by means of the strong functional representation lemma (SFRL) [8] to establish the operational validity of the RDP function [5]. In [9], the authors establish analytic properties of the RDP function for the special case of (scalar) Gaussian sources, with a quadratic distortion function and a perception loss function of either Kullback–Leibler (KL) divergence or Wasserstein-2 distance between the source and the reconstruction distributions. The role of common randomness in the study of RDP function has been studied in [10, 11]. Furthermore, the distortion-perception tradeoff with a squared error distortion and Wasserstein-2 perception loss, but without an explicit compression rate constraint, has been studied in [12, 13], where it is shown that the entire tradeoff curve can be achieved by interpolating the two extremal reconstructions based on a given representation. Other related works include [14, 15].

This paper studies the RDP function of a Gaussian vector source under a squared error distortion and either KL divergence or Wasserstein-2 distance as the perception loss metric. Our result is thus an extension of prior work [9] on scalar Gaussian sources to the case of vector sources. We start by demonstrating the optimality of jointly Gaussian reconstructions for Gaussian vector sources in the RDP setting. We then show that by decomposing the Gaussian vector source using the unitary transformation obtained from the eigenvalue decomposition of its covariance matrix, it is possible to derive an achievable RDP function of the Gaussian vector source in term of the RDP functions of its constituent scalar components. The optimality of this achievable scheme can be established by a converse proof. This means that the characterization of the optimal RDP function can be formulated as an optimization problem. We explicitly derive the solution of the optimization problem and investigate structural properties of the optimal solution.

The optimal RDP function for the Gaussian vector source has the following interesting property. Without the perception constraint, the rate-distortion function of a parallel Gaussian source model has a classical reverse water-filling characterization [16, Thm 10.3], where the optimal rate allocation across the components is computed according to a distortion dependent parameter called water-level. A positive rate is assigned to those components that have a variance above this parameter. Any component whose variance is below the water-level has a zero rate; see Fig. 1(a). However, with a perception constraint, we observe a qualitatively different solution as shown in Fig. 1(b). First, unlike the case of reverse water-filling, the associated water-level for each component can be different and is characterized as a solution to a set of equations. Second, while reverse water-filling assigns zero rate to those source components whose variances are below the water-level, all components in the RDP setting are assigned a non-zero rate as long as both the distortion and perception constraints are active.

Refer to caption — Figure 1: (a) Without a perception constraint, the traditional reverse water-filling solution for a parallel Gaussian source fixes a constant water-level. When the variance of a specific component is less than the water-level, it is assigned zero rate. (b) With an active perception constraint, unequal water-levels are assigned to different components. The variance of each component is always greater than the corresponding water-level. Every component has a positive rate.

We further consider the special case of zero perception loss (so the source and reconstruction distributions are identical) and establish analytical results in this case. Moreover, we present asymptotic results on high and low distortion cases with zero perception, and shed additional insights into the difference between the RDP function and the RD function.

The rest of the paper is organized as follows. In Section II, we introduce the system model and some preliminaries. Some basics on the traditional reverse water-filling solution are provided in Section III. We discuss the generalized water-filling solution in Section IV for both KL-divergence and Wasserstein-2 distance as perception metrics; some properties of the RDP function are also discussed for perfect perceptual reconstruction; the asymptotic analysis is provided for both low and high distortion regimes.

Notation: We denote entropy, differential entropy and mutual information by $H(.)$ , $h(.)$ and $I(.;.)$ , respectively. The cardinality of the set $\mathcal{X}$ is written as $|\mathcal{X}|$ . We use $P_{X}$ to denote the probability distribution function of a random vector $X$ . We use $\mathcal{N}(\mu,\Sigma)$ to denote the Gaussian distribution with mean $\mu$ and covariance matrix $\Sigma$ . We use $\mathbb{E}[\cdot]$ to denote the expectation operator, and $\mathbb{R}$ to denote the set of real numbers. Throughout this paper, the base of the logarithm function is $e$ .

II System Model and Preliminaries

Let $X\sim P_{X}$ be an $L$ -dimensional Gaussian vector source with mean $0$ and covariance matrix $\Sigma_{X}\succ 0$ . Consider the eigenvalue decomposition of $\Sigma_{X}$ as follows:

\displaystyle\Sigma_{X}=\Theta^{T}\Lambda_{X}\Theta,

(1)

where $\Theta$ is unitary and $\Lambda_{X}$ is a diagonal matrix of positive eigenvalues¹¹1 Note that if some of the eigenvalues are zero, the corresponding columns of the unitary matrix $\Theta$ can be removed, and we have a diagonal $\Lambda_{X}$ of lower dimension. The rest of the derivations follows the same way.

\displaystyle\Lambda_{X}=\text{diag}^{L}(\lambda_{1},\ldots,\lambda_{L}).

(2)

We assume that there is unlimited common randomness $K\in\mathcal{K}$ shared between the encoder and the decoder. Consider the following one-shot encoding and decoding functions where the source samples are encoded one at a time:

	$\displaystyle f$	$\displaystyle\colon$	$\displaystyle\mathbbm{R}^{L}\times\mathcal{K}\to\mathcal{M},$		(3)
	$\displaystyle g$	$\displaystyle\colon$	$\displaystyle\mathcal{M}\times\mathcal{K}\to\mathbbm{R}^{L}.$		(4)

Here, $\mathcal{M}$ denotes the set of messages. Let $P_{\hat{X}}$ be the distribution of the reconstruction induced by the encoding and decoding mechanisms. In this paper, we measure distortion using a squared-error loss function $d\colon\mathbbm{R}^{L}\times\mathbbm{R}^{L}\to\mathbbm{R}_{\geq 0}$ where $d(x,\hat{x}):=\|x-\hat{x}\|^{2}$ . From a perceptual perspective, for given probability distributions $P_{X}$ and $P_{\hat{X}}$ , we use $\phi(P_{X},P_{\hat{X}})$ to denote the perception loss function capturing the difference between the two distributions. For the two perception metrics that we consider in the following discussion, we have $\phi(P_{X},P_{\hat{X}})=0$ if and only if $P_{X}=P_{\hat{X}}$ .

The above framework is referred to as the one-shot setting, because it compresses one sample at a time. We can also define the setting of encoding $n$ independently and identically distributed (i.i.d.) samples $X^{n}=(X_{1},\ldots,X_{n})$ and reconstructing $\hat{X}^{n}=(\hat{X}_{1},\ldots,\hat{X}_{n})$ , and consider the asymptotic setting with $n\to\infty$ .

Definition 1 (Operational RDP Functions)

Let $X\sim P_{X}$ . For given distortion-perception constraints $(D,P)$ , a rate $R$ is said to be achievable if there exist encoding and decoding functions satisfying

$\displaystyle\mathbbm{E}[\ell(M)]$	$\displaystyle\leq$	$\displaystyle R,$	(5)
$\displaystyle\mathbbm{E}[\\|X-\hat{X}\\|^{2}]$	$\displaystyle\leq$	$\displaystyle D,$	(6)
$\displaystyle\phi(P_{X},P_{\hat{X}})$	$\displaystyle\leq$	$\displaystyle P,$	(7)

where $\ell(M)$ denotes the length of the message $M$ for encoding one sample. The infimum of all achievable rates $R$ is called the one-shot rate-distortion-perception (RDP) function, denoted as $R^{o}(D,P)$ .

For the asymptotic setting, given distortion-perception constraints $(D,P)$ , a rate $R$ is said to be achievable if there exist encoding and decoding functions such that

	$\displaystyle\lim_{n\to\infty}\frac{1}{n}\sum_{i=1}^{n}\mathbbm{E}[\\|X_{i}-% \hat{X}_{i}\\|^{2}]$	$\displaystyle\leq$	$\displaystyle D,$		(8)
	$\displaystyle\lim_{n\to\infty}\frac{1}{n}\sum_{i=1}^{n}\phi(P_{X_{i}},P_{\hat{% X}_{i}})$	$\displaystyle\leq$	$\displaystyle P,$		(9)

with the message $M$ that encodes $X^{n}$ satisfying

\displaystyle\lim_{n\to\infty}\frac{1}{n}\mathbbm{E}[\ell(M)]

\displaystyle\leq

\displaystyle R.

(10)

The infimum of all achievable rates is called the asymptotic RDP function, denoted as $R^{\infty}(D,P)$ .

Definition 2 (Information RDP Function)

For given $X\sim P_{X}$ , let $\mathcal{P}_{\hat{X}|X}(D,P)$ be the set of conditional distributions $P_{\hat{X}|X}$ such that for a fixed $(D,P)$ , we have

\mathbbm{E}[\|X-\hat{X}\|^{2}]\leq D,\qquad\phi(P_{X},P_{\hat{X}})\leq P.

(11)

The information rate-distortion-perception (RDP) function is defined as

R(D,P)=\inf_{P_{\hat{X}|X}\in\mathcal{P}_{\hat{X}|X}(D,P)}I(X;\hat{X}).

(12)

As explained in detail later, using the SFRL as in [8] and following similar steps to Theorem 2 and Theorem 5 in Appendix A.2 of [9], one can show that

R(D,P)\leq R^{o}(D,P)\leq R(D,P)+\log(R(D,P)+1)+5

(13)

and

R^{\infty}(D,P)=R(D,P).

(14)

Consequently, the one-shot operational RDP function $R^{o}(D,P)$ is asymptotically close to the information RDP function $R(D,P)$ and the asymptotic RDP function $R^{\infty}(D,P)$ at high rate.

In the rest of the paper, the perception metric $\phi(P_{X},P_{\hat{X}})$ is assumed to be either the KL-divergence, i.e.,

\displaystyle D(P_{\hat{X}}\|P_{X})=\int_{x}P_{\hat{X}}(x)\log\frac{P_{\hat{X}% }(x)}{P_{X}(x)}dx,

(15)

or the (squared) Wasserstein-2 distance, i.e.,

\displaystyle W_{2}^{2}(P_{X},P_{\hat{X}})=\inf\mathbbm{E}[\|X-\hat{X}\|^{2}],

(16)

where the infimum is taken over all joint distributions of $(X,\hat{X})$ with marginals $P_{X}$ and $P_{\hat{X}}$ .

Before characterizing the RDP function, we first review the case of no perception constraint, which corresponds to traditional reverse water-filling for the classical rate-distortion function.

III Traditional Reverse Water-Filling

The classical rate-distortion theory for a parallel Gaussian source states that the optimal rate allocated to each component depends on a constant parameter, called water-level, as shown in Fig. 1(a). The water-level also represents the distortion allowed at those components whose variances are above the water-level. For a given distortion $D$ , let $\nu(D)$ be the solution to the equation

\displaystyle\sum_{\ell=1}^{L}\left[\lambda_{\ell}-\nu(D)\right]^{+}

\displaystyle=

\displaystyle\left[\sum_{\ell=1}^{L}\lambda_{\ell}-D\right]^{+},

(17)

where $[x]^{+}:=\max\{0,x\}$ . Now, let

\displaystyle\gamma_{\ell}^{*}(D,\infty)=\left\{\begin{array}[]{ll}\lambda_{% \ell}&\;\;\text{if}\;\;\nu(D)\geq\lambda_{\ell},\\ \nu(D)&\;\;\text{if}\;\;\nu(D)<\lambda_{\ell}.\end{array}\right.

(20)

The rate-distortion function for the Gaussian vector source with variance $\lambda_{\ell}$ for its $\ell$ -th component, $\ell\in\{1,\ldots,L\}$ , is as follows.

Theorem 1 (Thm 10.3 in [16])

For a Gaussian vector source, we have

\displaystyle R(D,\infty)=\frac{1}{2}\sum_{\ell=1}^{L}\log\frac{\lambda_{\ell}% }{\gamma^{*}_{\ell}(D,\infty)}.

(21)

To simplify notation, we can redefine the water-level as $\gamma_{\ell}^{*}(D,\infty)$ in order to account for the components whose variances are below the water-level. If $\lambda_{\ell}$ is below $\nu(D)$ for some $\ell$ , then we set $\gamma_{\ell}^{*}(D,\infty)=\lambda_{\ell}$ and assign zero rate to this component. Two special cases of the above theorem are of special interest.

Proposition 1 (High-Distortion Compression)

In the high-distortion regime, we have that for sufficiently small $\epsilon>0$

\displaystyle R\left(\sum_{\ell=1}^{L}\lambda_{\ell}-\epsilon,\infty\right)=% \frac{\epsilon}{2\lambda^{\max}}+O(\epsilon^{2}),

(22)

where $\lambda^{\max}=\max_{\ell}\lambda_{\ell}$ . Let $L^{\max}$ denote the set of indices where their corresponding eigenvalues are equal to $\lambda^{\max}$ . Then, the water-levels are given by


$\displaystyle\gamma_{\ell}^{*}\left(\sum_{\ell=1}^{L}\lambda_{\ell}-\epsilon,% \infty\right)$	$\displaystyle=$	$\displaystyle\lambda_{\ell},\qquad\forall\ell\in\{1,\ldots,L\}\backslash L^{% \max},$	(23a)
$\displaystyle\gamma^{*}_{\ell^{\max}}\left(\sum_{\ell=1}^{L}\lambda_{\ell}-% \epsilon,\infty\right)$	$\displaystyle=$	$\displaystyle\lambda^{\max}-\frac{\epsilon}{\|L^{\max}\|},\;\;\forall\ell^{\max}% \in L^{\max}.$	(23b)

Proof:

See Appendix A-1. ∎

The above proposition states that in the high-distortion compression, a positive rate is only assigned to the components with the largest eigenvalue.

Proposition 2 (Low-Distortion Compression)

In the low-distortion regime, we have that for a sufficiently small $\epsilon>0$

\displaystyle R(\epsilon,\infty)=\frac{1}{2}\sum_{\ell=1}^{L}\log\frac{L% \lambda_{\ell}}{\epsilon},

(24)

where the water-levels are given by

\displaystyle\gamma_{\ell}^{*}(\epsilon,\infty)=\frac{\epsilon}{L},\qquad% \forall\ell\in\{1,\ldots,L\}.

(25)

Proof:

See Appendix A-2. ∎

For low-distortion compression, according to the above proposition, the same water-level is assigned to all components.

IV Rate-Distortion-Perception Function

IV-A Optimality of Gaussian Reconstruction

We first present a result indicating that for the two perception metrics (15) and (16) considered in this paper and for a Gaussian vector source, jointly Gaussian reconstruction is optimal.

Theorem 2

For a zero-mean Gaussian source $X$ , if the perception metric is either the KL-divergence or the Wasserstein-2 distance, without loss of optimality, in the optimization problem (12), we can restrict the reconstruction $\hat{X}$ to have mean zero and be jointly Gaussian with $X$ .

Proof:

See Appendix B. ∎

A common property of the two perception metrics that enables the above theorem to hold is that if the source is Gaussian distributed, conditional Gaussian reconstruction minimizes both metrics among those with the same first- and second-order joint statistics. Theorem 2 implies that the optimization of RDP function can be restricted to jointly Gaussian distributions that satisfy the distortion and perception constraints.

IV-B RDP Function with KL Divergence as Perception Metric

In this section, we present the RDP function with the KL-divergence as the perception metric, i.e., $\phi(P_{X},P_{\hat{X}})=D(P_{\hat{X}}\|P_{X})$ . The results for the Wasserstein-2 distance as the perception metric is stated in the subsequent section. We present both one-shot and asymptotic RDP functions. As already mentioned, the one-shot RDP function $R^{o}(D,P)$ is close to the information RDP function $R(D,P)$ at high rate. Here we provide explicit constructions of both one-shot and asymptotic coding strategies for achieving (close to) $R(D,P)$ .

The first step is to decompose the source using eigenvalue decomposition as in (1) and define

\displaystyle Z=\Theta X.

(26)

The main idea is to construct a new Gaussian random vector $\hat{Z}$ and to use the channel simulation result of [8] to communicate $\hat{Z}$ to the decoder at a rate of $R$ . The new random vector $\hat{Z}$ is designed to be correlated with $Z$ in a very specific way in order to satisfy the distortion and perception constraints $D$ and $P$ , respectively. The correlation between $Z$ and $\hat{Z}$ is controlled by two sets of parameters, $\{\gamma_{\ell}\}_{\ell=1}^{L}$ and $\{\hat{\lambda}_{\ell}\}_{\ell=1}^{L}$ , such that $0<\gamma_{\ell}\leq\lambda_{\ell}$ and $0<\hat{\lambda}_{\ell}\leq\lambda_{\ell}$ . The optimal values of these parameters will be determined later.

In effect, instead of the classical rate-distortion setting where $\hat{Z}$ is chosen to minimize the rate subject to the distortion constraint, here we choose $\hat{Z}$ to satisfy both distortion and perception constraints. We construct this noisy version of $Z$ at the decoder by taking advantage of the availability of common randomness.

Specifically, $\hat{Z}$ is a zero-mean random vector with a joint Gaussian distribution with $Z$ such that $(Z_{\ell},\hat{Z}_{\ell})$ for different $\ell\in\{1,\ldots,L\}$ , are mutually independent and

\mathrm{cov}(Z_{\ell},\hat{Z}_{\ell})=\left[\begin{array}[]{cc}\lambda_{\ell}&% \sqrt{\hat{\lambda}_{\ell}(\lambda_{\ell}-\gamma_{\ell})}\\ \sqrt{\hat{\lambda}_{\ell}(\lambda_{\ell}-\gamma_{\ell})}&\hat{\lambda}_{\ell}% \end{array}\right].

(27)

With the above covariance structure, we can verify that $\gamma_{\ell}$ is the minimum mean-squared error (MMSE) of estimating $Z_{\ell}$ based on $\hat{Z}_{\ell}$ , i.e.,

\displaystyle\gamma_{\ell}=\mathbbm{E}[(Z_{\ell}-\mathbbm{E}[Z_{\ell}|\hat{Z}_% {\ell}])^{2}].

(28)

Now, to derive the one-shot RDP function $R^{o}(D,P)$ , we can make use a consequence of the SFRL [8, Theorem 1] to show that when common randomness $K$ is available at both the encoder and decoder, there exists a channel simulation scheme that allows $\hat{Z}_{\ell}$ to be reconstructed at the decoder at a communication rate of

\displaystyle I(Z_{\ell};\hat{Z}_{\ell})+\log(I(Z_{\ell};\hat{Z}_{\ell})+1)+5.

(29)

After the reconstruction of $\hat{Z}_{\ell}$ at the decoder, we use the same unitary matrix to transform it into $\hat{X}$ , i.e.,

\displaystyle\hat{X}=\Theta^{T}\hat{Z}.

(30)

The above scheme leads to the one-shot rate, distortion, and perception loss for the $\ell$ -th component of $Z$ as functions of $\lambda_{\ell}$ , $\hat{\lambda}_{\ell}$ and $\gamma_{\ell}$ as follows:

$\displaystyle{R}_{\ell}(\gamma_{\ell})$	$\displaystyle=$	$\displaystyle\frac{1}{2}\log\left(\frac{\lambda_{\ell}}{\gamma_{\ell}}\right)+% \log\left(\frac{1}{2}\log\left(\frac{\lambda_{\ell}}{\gamma_{\ell}}\right)+1% \right)+5,$	(31)
$\displaystyle{D}_{\ell}(\gamma_{\ell},\hat{\lambda}_{\ell})$	$\displaystyle=$	$\displaystyle\lambda_{\ell}-2\sqrt{\hat{\lambda}_{\ell}(\lambda_{\ell}-\gamma_% {\ell})}+\hat{\lambda}_{\ell},$	(32)
$\displaystyle{P}_{\ell}(\hat{\lambda}_{\ell})$	$\displaystyle=$	$\displaystyle\frac{1}{2}\left(\frac{\hat{\lambda}_{\ell}}{\lambda_{\ell}}-1+% \log\frac{\lambda_{\ell}}{\hat{\lambda}_{\ell}}\right).$	(33)

This allows a characterization of an achievable one-shot RDP function of a Gaussian vector source as an optimization problem over $\hat{\lambda}_{\ell}$ and $\gamma_{\ell}$ across its components.

For the asymptotic setting, the achievable scheme is identical, except that we compress a block of $n$ samples together. As $n\rightarrow\infty$ , the logarithm and the constant terms in (31) can be neglected. This leads to an upper bound for $R^{\infty}(D,P)$ , which is equal to $R(D,P)$ . This upper bound turns out to be tight, i.e., a converse can be proved. This gives the following characterization of $R(D,P)$ .

Theorem 3

The rate-distortion-perception function $R(D,P)$ for a Gaussian vector source with parameters defined by (1) and (2), and with KL-divergence as the perception metric, is given by the solution to the following optimization problem:


$\displaystyle R(D,P)=$	$\displaystyle\min_{\{\hat{\lambda}_{\ell},\gamma_{{\ell}}\}_{\ell=1}^{L}}$	$\displaystyle\frac{1}{2}\sum_{\ell=1}^{L}\log\frac{\lambda_{\ell}}{\gamma_{{% \ell}}}$	(34a)
	s.t.	$\displaystyle 0<\gamma_{\ell}\leq\lambda_{\ell},$	(34e)
		$\displaystyle 0\leq\hat{\lambda}_{\ell},$
		$\displaystyle\sum_{\ell=1}^{L}{D}_{\ell}(\gamma_{\ell},\hat{\lambda}_{\ell})% \leq D,$
		$\displaystyle\sum_{\ell=1}^{L}{P}_{\ell}(\hat{\lambda}_{\ell})\leq P.$

Proof:

See Appendix C. ∎

An interpretation of the above is as follows. For a given $(D,P)$ , let $\gamma_{\ell}^{*}(D,P)$ and $\hat{\lambda}^{*}_{\ell}(D,P)$ , $\ell\in\{1,\ldots,L\}$ , be the optimal solution to (34). Comparing this with (21), it can be seen that $\gamma^{*}_{\ell}(D,P)$ can be interpreted as the water-level for the $\ell$ -th component, which determines the rate allocated to that component according to (34a); see Fig. 1(b).

IV-C Generalized Water-filling with KL Divergence as Perception Metric

We now proceed to analyze the solution to the optimization program in Theorem 3. It can be shown that the optimization problem (34) is convex. Let $\nu_{1}$ , $\nu_{2}$ , $\{\xi_{\ell}\}_{\ell=1}^{L}$ , $\{\eta_{\ell}\}_{\ell=1}^{L}$ be nonnegative Lagrange multipliers. For $\ell\in\{1,\ldots,L\}$ , we have the first-order conditions:

\frac{1}{2\gamma^{*}_{{\ell}}(D,P)}-\nu_{1}\sqrt{\frac{\hat{\lambda}^{*}_{{% \ell}}(D,P)}{\lambda_{{\ell}}-\gamma^{*}_{{\ell}}(D,P)}}-\xi_{\ell}=0,

(35)

and

\nu_{1}\left(-\sqrt{\frac{\lambda_{{\ell}}-\gamma^{*}_{{\ell}}(D,P)}{\hat{% \lambda}^{*}_{{\ell}}(D,P)}}+1\right)+\frac{\nu_{2}}{2}\left(\frac{1}{\lambda_% {{\ell}}}-\frac{1}{\hat{\lambda}^{*}_{{\ell}}(D,P)}\right)-\eta_{\ell}=0.

(36)

We first focus on the most interesting regime where the distortion and the perception constraints are both active so $\nu_{1},\nu_{2}>0$ , and $\gamma_{\ell}<\lambda_{\ell}$ , $\hat{\lambda}_{\ell}>0$ so that $\xi_{\ell}=\eta_{\ell}=0$ for all $\ell\in\{1,\ldots,L\}$ . In this case, (35) implies that $\hat{\lambda}^{*}_{\ell}(D,P)$ can be expressed as

\displaystyle\hat{\lambda}_{\ell}^{*}(D,P)=\frac{\lambda_{\ell}-\gamma^{*}_{% \ell}(D,P)}{4\gamma^{*2}_{\ell}(D,P)\nu_{1}^{2}}.

(37)

Together with (36), this means that $\gamma_{\ell}^{*}(D,P)$ is the positive solution to the following equation

\displaystyle\nu_{1}(1-2\nu_{1}\gamma^{*}_{\ell}(D,P))=\frac{1}{2}\nu_{2}\left% (\frac{4\gamma^{*2}_{\ell}(D,P)\nu_{1}^{2}}{\lambda_{\ell}-\gamma^{*}_{\ell}(D% ,P)}-\frac{1}{\lambda_{\ell}}\right),

(38)

which is quadratic in $\gamma^{*}_{\ell}(D,P)$ and can be solved analytically as follows:

\displaystyle\gamma^{*}_{\ell}(D,P)

\displaystyle=

\displaystyle\frac{-2\lambda_{\ell}\nu_{1}(1+2\lambda_{\ell}\nu_{1})-\nu_{2}+% \sqrt{(\nu_{2}+2\lambda_{\ell}\nu_{1}+4\lambda_{\ell}^{2}\nu_{1}^{2})^{2}+16% \lambda_{\ell}^{2}\nu_{1}^{2}(\nu_{2}+2\lambda_{\ell}\nu_{1})(\nu_{2}-1)}}{8% \lambda_{\ell}\nu_{1}^{2}(-1+\nu_{2})}.

There is an alternative expression for $\gamma^{*}_{\ell}(D,P)$ in term of $\hat{\lambda}_{\ell}^{*}(D,P)$ that can be obtained by solving (37) as a quadratic equation in $\gamma^{*}_{\ell}(D,P)$ as below:

\gamma^{*}_{\ell}(D,P)=\frac{2\lambda_{\ell}}{1+\sqrt{1+16\lambda_{\ell}\hat{% \lambda}_{\ell}^{*}(D,P)\nu_{1}^{2}}}.

(40)

This expression is useful later in Corollary 1.

The expressions (IV-C) and (37) give us the following generalized reverse water-filling interpretation of the optimal RDP solution. At given distortion constraint $D$ and perception constraint $P$ , each component of the source with variance $\lambda_{\ell}$ is reconstructed by $\hat{Z}_{\ell}$ having a variance $\hat{\lambda}^{*}_{\ell}(D,P)$ . Because $\gamma^{*}_{\ell}(D,P)$ is the variance of the MMSE estimate of $Z_{\ell}$ given $\hat{Z}_{\ell}$ , this requires a rate of $\frac{1}{2}\log\left(\frac{\lambda_{\ell}}{\gamma^{*}_{\ell}(D,P)}\right)$ . The parameters $\hat{\lambda}^{*}_{\ell}(D,P)$ and $\gamma^{*}_{\ell}(D,P)$ are chosen to satisfy the distortion and perception constraints. As already mentioned, $\gamma^{*}_{\ell}(D,P)$ can be thought of as the water-level, cf. (21).

When both the distortion and the perception constraints are active, i.e., $\nu_{1},\nu_{2}>0$ , it is possible to prove (as shown in the theorem below) that

\gamma^{*}_{\ell}(D,P)<\lambda_{\ell},\quad\forall\ell\in\{1,\cdots,L\},

(41)

so every component of the source is always allocated a non-zero rate regardless of the distortion constraint—unlike the traditional reverse water-filling solution, where a component may be allocated zero rate if its variance is below the water-level. Moreover, in contrast to the traditional reverse water-filling, the distortion of each component (i.e., $D_{\ell}(\gamma_{\ell}^{*}(D,P),\hat{\lambda}_{\ell}^{*}(D,P))$ ) may not be the same across the different components. So, an unequal-distortion allocation may be optimal when both perception and distortion constraints are active.

It is also possible that either the distortion or the perception constraint is not active. If the distortion constraint is active while the perception constraint is inactive, i.e., $\nu_{1}>0$ and $\nu_{2}=0$ , and $\eta_{\ell}=\eta^{\prime}_{\ell}=0$ for all $\ell\in\{1,\ldots,L\}$ , then (35) and (36) yield the traditional reverse water-filling solution. Specifically, the water-level is given by $\min\{\frac{1}{2\nu_{1}},\lambda_{\ell}\}$ where $\nu_{1}$ satisfies the following:

\displaystyle\sum_{\ell=1}^{L}\left[\lambda_{\ell}-\frac{1}{2\nu_{1}}\right]^{% +}=\left[\sum_{\ell=1}^{L}\lambda_{\ell}-D\right]^{+}.

(42)

By redefining $\frac{1}{2\nu_{1}}$ as $\nu(D)$ , we see that the above expression is the same as (17).

If the distortion constraint is inactive, i.e., $\nu_{1}=0$ , based on (35), we have $\xi_{\ell}>0$ which yields

\gamma^{*}_{\ell}(D,P)=\lambda_{\ell},\qquad\forall\ell\in\{1,\ldots,L\}.

(43)

This implies that every component of the source is assigned a zero rate if the distortion constraint is not active. The decoder simply generates the reconstruction independent of the source using a distribution that satisfies the perception constraint. Such a distribution may not be unique, as shown in the theorem below.

An interesting observation is that based on (41) and (43), we see that when the perception constraint is active, it is either that all the components are allocated positive rate, or that all the components are allocated zero rate. This means that the situation in the traditional reverse water-filling, where some of the water-levels are below the eigenvalues while others are equal to the eigenvalues, cannot happen, when the perception constraint is active.

The above discussion is summarized in the following.

Theorem 4

Let $(D,P)$ be a given distortion and perception constraints that are strictly feasible. The optimal solution of (34) with KL divergence as the perception metric is given as follows:

If both the distortion and perception constraints are active²²2A constraint of a minimization problem is said to be inactive if the optimization problem with the same objective function but with the said constraint removed (while kee** all the other constraints) has at least one optimal solution that already satisfies all the original constraints., then there exist $\nu_{1},\nu_{2}>0$ such that $\gamma^{*}_{\ell}(D,P)$ is as expressed in (IV-C) and $\hat{\lambda}^{*}_{\ell}(D,P)$ is as expressed in (37). Here, $\nu_{1}$ and $\nu_{2}$ are chosen such that

	$\displaystyle\sum_{\ell=1}^{L}D_{\ell}(\gamma^{}_{\ell}(D,P),\hat{\lambda}_{% \ell}^{}(D,P))$	$\displaystyle=$	$\displaystyle D,$		(44)
	$\displaystyle\sum_{\ell=1}^{L}P_{\ell}(\hat{\lambda}_{\ell}^{*}(D,P))$	$\displaystyle=$	$\displaystyle P.$		(45)

In this case, every component has a positive rate.

If the distortion constraint is active but the perception constraint is inactive, then there exists $\nu_{1}>0$ such that $\gamma^{*}_{\ell}(D,P)=\min\{\frac{1}{2\nu_{1}},\lambda_{\ell}\}$ , $\hat{\lambda}^{*}_{\ell}(D,P)=\lambda_{\ell}-\min\{\frac{1}{2\nu_{1}},\lambda_% {\ell}\}$ and

\displaystyle\sum_{\ell=1}^{L}\left[\lambda_{\ell}-\frac{1}{2\nu_{1}}\right]^{% +}=\left[\sum_{\ell=1}^{L}\lambda_{\ell}-D\right]^{+}.

(46)

In this case, some components may have zero rate.

If the distortion constraint is inactive, then $\gamma^{*}_{\ell}(D,P)=\lambda_{\ell}$ , and $\hat{\lambda}^{*}_{\ell}(D,P)$ can be any value in the set

\left\{\hat{\lambda}_{\ell}\ \left|\ \sum_{\ell=1}^{L}{P}_{\ell}(\hat{\lambda}% _{\ell})\leq P,\ \ \sum_{\ell=1}^{L}\lambda_{\ell}+\hat{\lambda}_{\ell}\leq D,% \ \ \hat{\lambda}_{\ell}\geq 0\right.\right\}.

(47)

In this case, every component has zero rate.

Proof:

See Appendix D. ∎

IV-D RDP Function and Generalized Reverse Water-filling with Wasserstein-2 Distance as Perception Metric

Next, consider the Wasserstein-2 distance as the perception metric, i.e., $\phi(P_{X},P_{\hat{X}})=W_{2}^{2}(P_{X},$ $P_{\hat{X}})$ . To that end, we have the following definitions for distortion and perception loss functions. Let the distortion loss function of the $\ell$ -th component be as in (32). Replace the perception loss function in (33) by the following:

\displaystyle{P}_{\ell}(\hat{\lambda}_{\ell})=\left(\sqrt{\lambda_{\ell}}-% \sqrt{\hat{\lambda}_{\ell}}\right)^{2}.

(48)

The following theorem characterizes the RDP function with Wasserstein-2 perception loss in terms of an optimization problem.

Theorem 5

The rate-distortion-perception function $R(D,P)$ with Wasserstein-2 distance as the perception metric is given by the optimization program in (34) with the perception loss function (33) replaced by (48).

Proof:

The proof is similar to that of Theorem 3 with some differences which are highlighted in Appendix E. ∎

Similar to the KL-divergence case, the optimization program for the Wasserstein-2 distance is convex. For $\ell\in\{1,\ldots,L\}$ , we have the following first-order conditions:

\displaystyle\frac{1}{2\gamma_{\ell}^{*}(D,P)}-\nu_{1}\sqrt{\frac{\hat{\lambda% }^{*}_{\ell}(D,P)}{\lambda_{\ell}-\gamma^{*}_{\ell}(D,P)}}-\xi_{\ell}=0,

(49)

and

\displaystyle\nu_{1}\left(-\sqrt{\frac{\lambda_{{\ell}}-\gamma^{*}_{{\ell}}(D,% P)}{\hat{\lambda}^{*}_{{\ell}}(D,P)}}+1\right)+\nu_{2}\left(1-\sqrt{\frac{% \lambda_{\ell}}{\hat{\lambda}^{*}_{\ell}(D,P)}}\right)+\eta_{\ell}=0.

(50)

Consider the case where both distortion and perception constraints are active, i.e., $\nu_{1},\nu_{2}>0$ and $\xi_{\ell}=\eta_{\ell}=0$ for all $\ell\in\{1,\ldots,L\}$ . In this case, (49) and (50) yield the following solutions

	$\displaystyle\gamma_{\ell}^{*}(D,P)$	$\displaystyle=$	$\displaystyle\frac{\theta_{\ell}}{2\nu_{1}},$		(51)
	$\displaystyle\hat{\lambda}_{\ell}^{*}(D,P)$	$\displaystyle=$	$\displaystyle\frac{\lambda_{\ell}}{\left(1+\frac{(1-\theta_{\ell})\nu_{1}}{\nu% _{2}}\right)^{2}},$		(52)

where $\theta_{\ell}$ is defined to be the unique solution of the following equation:

\displaystyle\frac{\theta_{\ell}}{1+\frac{(1-\theta_{\ell})\nu_{1}}{\nu_{2}}}=% \sqrt{1-\frac{\theta_{\ell}}{2\nu_{1}\lambda_{\ell}}}.

(53)

As in the case of KL divergence, it is possible to prove that when both the distortion and the perception constraints are active we have $\gamma^{*}_{\ell}(D,P)<\lambda_{\ell}$ . Thus, every component is compressed at a positive rate.

When the distortion constraint is active but the perception constraint is not active, the problem reduces to traditional reverse water-filling. Finally, when the distortion constraint is not active, i.e., $\nu_{1}=0$ , a zero rate is assigned to all components. This discussion is summarized in the following.

Theorem 6

Let $(D,P)$ be a given distortion and perception constraints that are strictly feasible. The optimal solution of (34) with the perception metric (33) replaced by (48) is given as follows:

If both the distortion and perception constraints are active, then there exist $\nu_{1},\nu_{2}>0$ such that $\gamma^{*}_{\ell}(D,P)$ is as expressed in (51) and $\hat{\lambda}^{*}_{\ell}(D,P)$ is as expressed in (52). Here, $\nu_{1}$ and $\nu_{2}$ are chosen such that

	$\displaystyle\sum_{\ell=1}^{L}D_{\ell}(\gamma^{}_{\ell}(D,P),\hat{\lambda}_{% \ell}^{}(D,P))$	$\displaystyle=$	$\displaystyle D,$		(54)
	$\displaystyle\sum_{\ell=1}^{L}P_{\ell}(\hat{\lambda}_{\ell}^{*}(D,P))$	$\displaystyle=$	$\displaystyle P.$		(55)

In this case, every component has a positive rate.

\displaystyle\sum_{\ell=1}^{L}\left[\lambda_{\ell}-\frac{1}{2\nu_{1}}\right]^{% +}=\left[\sum_{\ell=1}^{L}\lambda_{\ell}-D\right]^{+}.

(56)

In this case, some components may have zero rate.

If the distortion constraint is inactive, then $\gamma^{*}_{\ell}(D,P)=\lambda_{\ell}$ , and $\hat{\lambda}^{*}_{\ell}(D,P)$ can be any value in the set

\left\{\hat{\lambda}_{\ell}\ \left|\ \sum_{\ell=1}^{L}{P}_{\ell}(\hat{\lambda}% _{\ell})\leq P,\ \ \sum_{\ell=1}^{L}\lambda_{\ell}+\hat{\lambda}_{\ell}\leq D,% \ \ \hat{\lambda}_{\ell}\geq 0\right.\right\}.

(57)

In this case, every component has zero rate.

Proof:

See Appendix F. ∎

IV-E Perceptually Perfect Reconstruction

In this section, we focus on the special case of perfect perceptual quality, and study the properties of the RDP function with $P=0$ .

Corollary 1

The RDP function of a Gaussian vector source with $P=0$ is

\displaystyle R(D,0)=\frac{1}{2}\sum_{\ell=1}^{L}\log\frac{1+\sqrt{1+16\nu_{1}% ^{2}\lambda_{\ell}^{2}}}{2},

(58)

for some positive $\nu_{1}$ that satisfies

\displaystyle D=\sum_{\ell=1}^{L}\left[2\lambda_{\ell}-2\sqrt{\lambda_{\ell}% \left(\lambda_{\ell}-\gamma^{*}_{\ell}(D,0)\right)}\right],

(59)

where

\displaystyle\gamma^{*}_{\ell}(D,0)=\frac{2\lambda_{\ell}}{1+\sqrt{1+16\nu_{1}% ^{2}\lambda_{\ell}^{2}}},\qquad\ell\in\{1,\ldots,L\}.

(60)

Proof:

See Appendix G. ∎

An interpretation of the optimal rate allocation in this $P=0$ case is as follows. By (58), the optimal rate allocated to the $\ell$ -th component is controlled by the expression $\frac{1+\sqrt{1+16\nu_{1}^{2}\lambda_{\ell}^{2}}}{2}$ . So, if a component has a larger variance, it is compressed at a higher rate. Further, by (60) it also has a higher water-level.

Under general perception and distortion constraints, the encoding and decoding strategy adopted in this paper (which involves constructing $\hat{Z}_{\ell}$ as in (27)) can be thought of as first compressing each component of the source at an individual rate specified by the distortion level $\gamma_{\ell}^{*}(D,P)$ based on the conventional rate-distortion tradeoff, then scaling the compressed source to a variance of $\hat{\lambda}_{\ell}^{*}(D,P)$ to satisfy the perception constraint. For the perfect perception case with $P=0$ , the compression rate becomes (58) and the distortion level becomes (60); further, each component of the compressed signal is simply scaled to match the variance of the source in order to ensure zero perception loss. The distortion after scaling is given by (59). This is shown in Fig. 2.

We further note that at a fixed $R$ , the rate allocated to each component is in general different for different $(D,P)$ tradeoff points. Whereas for the scalar Gaussian source, a universal representation for different $(D,P)$ points at a fixed $R$ is possible via scaling [9], for the Gaussian vector source such universal representation does not exist, due to the different rate allocations in each component at different $(D,P)$ tradeoff points.

Next, we investigate the asymptotic behavior of the compression rate and the distortion level in the perfect perception case.

Proposition 3 (High-Distortion Compression)

In the high-distortion and perfect perception regime, we have that for sufficiently small $\epsilon>0$ ,

\displaystyle R\left(2\sum_{\ell=1}^{L}\lambda_{\ell}-\epsilon,0\right)=\frac{% \epsilon^{2}}{8\sum_{\ell=1}^{L}\lambda_{\ell}^{2}}+O(\epsilon^{3}),

(61)

where the water-levels are given by

\displaystyle\gamma^{*}_{\ell}\left(2\sum_{\ell=1}^{L}\lambda_{\ell}-\epsilon,% 0\right)=\lambda_{\ell}-\frac{\epsilon^{2}\lambda_{\ell}^{3}}{4\left(\sum_{% \ell=1}^{L}\lambda_{\ell}^{2}\right)^{2}}+O(\epsilon^{3}),\quad\ell\in\{1,% \ldots,L\}.

(62)

Proof:

See Appendix H-1.∎

Here, we express $R(D,0)$ in term of deviation from the maximum distortion at perfect perception at zero rate. This maximum distortion can be shown to be $2\sum_{\ell=1}^{L}\lambda_{\ell}$ , which is twice of the total variance of the source [9], because at zero rate the decoder should simply generate an independent Gaussian random vector with the same covariance matrix. Comparing $R\left(2\sum_{\ell=1}^{L}\lambda_{\ell}-\epsilon,0\right)$ of Proposition 3 with $R\left(\sum_{\ell=1}^{L}\lambda_{\ell}-\epsilon,\infty\right)$ in Proposition 1, it is interesting to see that the variances of the source enter $R\left(2\sum_{\ell=1}^{L}\lambda_{\ell}-\epsilon,0\right)$ as $\sum_{\ell=1}^{L}\lambda^{2}_{\ell}$ which is the sum of the square of the variances over all the components. This is in contrast to the corresponding factor in $R\left(\sum_{\ell=1}^{L}\lambda_{\ell}-\epsilon,\infty\right)$ in the traditional reverse water-filling solution which is simply $\lambda^{\max}$ . This is a consequence of the perfect perception constraint, which requires all the components to be reconstructed with the same variances as the source at the decoder.

Proposition 4 (Low-Distortion Compression)

In the low-distortion and perfect perception regime, we have that for sufficiently small $\epsilon>0$ ,

\displaystyle R(\epsilon,0)=\frac{1}{2}\sum_{\ell=1}^{L}\log\frac{L\lambda_{% \ell}}{\epsilon}+\frac{\epsilon}{8L}\sum_{\ell=1}^{L}\frac{1}{\lambda_{\ell}}+% O(\epsilon^{2}),

(63)

where the water-levels are given by

\displaystyle\gamma^{*}_{\ell}(\epsilon,0)=\frac{\epsilon}{L}-\frac{\epsilon^{% 2}}{2L^{2}\lambda_{\ell}}+\frac{\epsilon^{2}}{4L^{3}}\sum_{\ell=1}^{L}\frac{1}% {\lambda_{\ell}}+O(\epsilon^{3}),\quad\ell\in\{1,\ldots,L\}.

(64)

Proof:

See Appendix H-2. ∎

Comparing Proposition 4 with Proposition 2, we see that in this high-rate low-distortion regime, the extra rate required to satisfy zero-perception scales as

	$\displaystyle R(\epsilon,0)-R(\epsilon,\infty)$	$\displaystyle=$	$\displaystyle\frac{\epsilon}{8L}\sum_{\ell=1}^{L}\frac{1}{\lambda_{\ell}}+O(% \epsilon^{2}),$		(65)
	$\displaystyle\gamma^{}_{\ell}(\epsilon,\infty)-\gamma^{}_{\ell}(\epsilon,0)$	$\displaystyle=$	$\displaystyle\frac{\epsilon^{2}}{2L^{2}\lambda_{\ell}}-\frac{\epsilon^{2}}{4L^% {3}}\sum_{\ell=1}^{L}\frac{1}{\lambda_{\ell}}+O(\epsilon^{3}),\quad\ell\in\{1,% \ldots,L\}.$		(66)

Fig. 3 shows the water-levels of different components for both low-distortion and high-distortion compression with $P=\infty$ or $P=0$ for an example of a Gaussian vector source. The water-levels determine the compression rates assigned to each component.

In Fig. 3(a), for high-distortion compression with no perception constraint, all components except the one with the largest eigenvalue are allocated a zero compression rate (cf. Proposition 1). With an active perception constarint, as shown in Fig. 3(c) for the $P=0$ case, all components are allocated positive rates (cf. Proposition 3).

In Fig. 3(b), for low-distortion compression with no perception constraint, the water-levels of all components are the same (cf. Proposition 2). At low distortion and with an active perception constraint, as shown in Fig. 3(d) for the $P=0$ case, the water-levels of different components are approximately equal with some slight differences which are determined by (64) in Proposition 4. Therefore, in the low-distortion regime, the water-levels of all components are approximately the same regardless of the perception constraint.

V Conclusions

This paper characterizes the RDP function for a Gaussian vector source. In contrast to the traditional reverse water-filling solution (without a perception constraint), the water-levels assigned to different components are not necessarily equal. When both distortion and perception constraints are active, every component is assigned a positive rate. These results have implications to perception-aware image coding.

Appendix A Asymptotic Analysis of the Traditional RD Function

A-1 High-Distortion Compression

Here, we consider $D=\sum_{\ell=1}^{L}\lambda_{\ell}-\epsilon$ for sufficiently small $\epsilon>0$ . Without loss of generality, we assume that the eigenvalues are ordered as follows

\displaystyle\lambda_{1}\leq\lambda_{2}\leq\ldots\leq\lambda_{L}.

(67)

First consider the case that $|L^{\max}|=1$ . The distortion constraint (17) implies that

\displaystyle\sum_{\ell=1}^{L}[\lambda_{\ell}-\nu(D)]^{+}=\epsilon.

(68)

The above condition implies that for a small enough $\epsilon>0$ , $\nu(D)$ should satisfy

\displaystyle\lambda_{1}\leq\lambda_{2}\leq\ldots\leq\lambda_{L-1}\leq\nu(D)<% \lambda_{L}.

(69)

Considering (69) with (68) yields

\displaystyle\nu(D)=\lambda_{L}-\epsilon.

(70)

Plugging the above into the RDP function of Proposition 1, we get

	$\displaystyle R\left(\sum_{\ell=1}^{L}\lambda_{\ell}-\epsilon,\infty\right)$	$\displaystyle=$	$\displaystyle\frac{1}{2}\log\frac{\lambda_{L}}{\lambda_{L}-\epsilon}$		(71)
		$\displaystyle=$	$\displaystyle\frac{1}{2\lambda_{L}}\epsilon+O(\epsilon^{2}).$		(72)

Finally, noting $\lambda_{L}=\max_{\ell}\lambda_{\ell}$ gives (22).

If $|L^{\max}|>1$ , then similar to the above discussion, all eigenvalues except the largest ones are assigned a zero compression rate and for the maximum eigenvalues, we have the following water-level

\displaystyle\nu(D)=\lambda^{\max}-\frac{\epsilon}{|L^{\max}|},

(73)

and the following rate

	$\displaystyle R\left(\sum_{\ell=1}^{L}\lambda_{\ell}-\epsilon,\infty\right)$	$\displaystyle=$	$\displaystyle\frac{\|L^{\max}\|}{2}\log\frac{\lambda_{L}}{\lambda_{L}-\frac{% \epsilon}{\|L^{\max}\|}}$		(74)
		$\displaystyle=$	$\displaystyle\frac{1}{2\lambda_{L}}\epsilon+O(\epsilon^{2}).$		(75)

This proves (22) for arbitrary $L^{\max}$ .

A-2 Low-Distortion Compression

Consider the case of $D=\epsilon$ for sufficiently small $\epsilon>0$ . In this low-distortion regime, the constant water-level $\nu(D)$ is not saturated by the eigenvalues. Thus, Proposition 1 simplifies to the following

\displaystyle R(\epsilon,\infty)=\frac{1}{2}\sum_{\ell=1}^{L}\log\frac{\lambda% _{\ell}}{\nu(D)}.

(76)

Also, the distortion constraint (17) implies that

\displaystyle\nu(D)=\frac{D}{L}.

(77)

Combining (76) and (77), we get the rate expression (24) in Proposition 2.

Appendix B Proof of Theorem 2

First, we prove the optimality of Gaussian reconstruction for the case of the KL-divergence as the perception metric. Define the following distribution

\displaystyle P_{\hat{X}^{*}|X}=\arg\min_{\begin{subarray}{c}P_{\hat{X}|X}:\\ \mathbb{E}[\|X-\hat{X}\|^{2}]\leq D\\ D(P_{\hat{X}}\|P_{X})\leq P\end{subarray}}I(X;\hat{X}).

(78)

Now, let $\hat{X}_{G}$ be a random variable jointly Gaussian distributed with $X$ such that


$\displaystyle\mathbbm{E}[\hat{X}_{G}]$	$\displaystyle=$	$\displaystyle\mathbbm{E}[\hat{X}^{*}],$	(79a)
$\displaystyle\text{cov}(\hat{X}_{G},X)$	$\displaystyle=$	$\displaystyle\text{cov}(\hat{X}^{*},X).$	(79b)

We proceed with lower bounding the rate as follows

$\displaystyle I(X;\hat{X}^{*})$	$\displaystyle=$	$\displaystyle h(X)-h(X\|\hat{X}^{*})$	(80)
	$\displaystyle\geq$	$\displaystyle h(X)-h(X\|\hat{X}_{G})$	(81)
	$\displaystyle=$	$\displaystyle I(X;\hat{X}_{G}),$	(82)

where (81) follows from (79) and the fact that under a fixed covariance matrix, a jointly Gaussian distribution maximizes the conditional differential entropy [17, Lemma 2]. The condition (79) also implies that for the distortion loss, we have

\displaystyle D\geq\mathbbm{E}[\|X-\hat{X}^{*}\|^{2}]=\mathbbm{E}[\|X-\hat{X}_% {G}\|^{2}].

(83)

Moreover, for the perception loss, we have

$\displaystyle D(P_{\hat{X}^{*}}\\|P_{X})$	$\displaystyle=$	$\displaystyle\int P_{\hat{X}^{}}(x)\log\frac{P_{\hat{X}^{}}(x)}{P_{X}(x)}dx$	(84)
	$\displaystyle=$	$\displaystyle-h(\hat{X}^{})-\int P_{\hat{X}^{}}(x)\log P_{X}(x)dx$	(85)
	$\displaystyle=$	$\displaystyle-h(\hat{X}^{})+\frac{1}{2}\int P_{\hat{X}^{}}(x)x\Sigma_{X}^{-1% }x^{T}dx+\frac{1}{2}\log(2\pi)^{L}\det(\Sigma_{X})$	(86)
	$\displaystyle=$	$\displaystyle-h(\hat{X}^{*})+\frac{1}{2}\int P_{\hat{X}_{G}}(x)x\Sigma_{X}^{-1% }x^{T}dx+\frac{1}{2}\log(2\pi)^{L}\det(\Sigma_{X})$	(87)
	$\displaystyle=$	$\displaystyle-h(\hat{X}^{*})-\int P_{\hat{X}_{G}}(x)\log P_{X}(x)dx$	(88)
	$\displaystyle\geq$	$\displaystyle-h(\hat{X}_{G})-\int P_{\hat{X}_{G}}(x)\log P_{X}(x)dx$	(89)
	$\displaystyle=$	$\displaystyle D(P_{\hat{X}_{G}}\\|P_{X}),$	(90)

where (87) follows because the expression $x\Sigma_{X}^{-1}x^{T}$ for a vector $x=(x_{1},\ldots,x_{L})$ only contains the terms such as $x_{\ell}^{2}$ , $x_{\ell}$ and $x_{\ell}x_{\ell^{\prime}}$ for $\ell,\ell^{\prime}\in\{1,\ldots,L\}$ , and since according to (79), $\hat{X}^{*}$ has the same mean and covariance matrix as $\hat{X}_{G}$ , the expected values of these terms with respect to $P_{\hat{X}^{*}}$ are equal to the same expectations calculated with respect to $P_{\hat{X}_{G}}$ ; (89) follows because for a fixed covariance matrix, the differential entropy is maximized by a Gaussian distribution [16, Thm 8.6.5]. Finally, there is no loss of optimality in setting $\mathbb{E}[\hat{X}_{G}]=0$ since replacing $\hat{X}_{G}$ with $\hat{X}_{G}-\mathbb{E}[\hat{X}_{G}]$ does not increase $I(X;\hat{X}_{G})$ , $\mathbb{E}[\|X-\hat{X}_{G}\|^{2}]$ , and $D(P_{\hat{X}_{G}}\|P_{X})$ .

Thus, replacing $\hat{X}^{*}$ by $\hat{X}_{G}$ does not increase the rate, while distortion and perception constraints remain to be satisfied. Thus, the optimal $\hat{X}^{*}$ must be jointly Gaussian with $X$ .

For the case of the Wasserstein-2 distance as the perception metric, lower bounding steps for $I(X;\hat{X}^{*})$ and $\mathbbm{E}[\|X-\hat{X}^{*}\|^{2}]$ are the same as (82) and (83), respectively. For the perception metric, the steps are refined as follows. Define the following distribution

\displaystyle P_{U^{*}V^{*}}=\arg\inf_{\begin{subarray}{c}\tilde{P}_{UV}:\\ \tilde{P}_{U}=P_{X}\\ \tilde{P}_{V}=P_{\hat{X}^{*}}\end{subarray}}\mathbbm{E}_{\tilde{P}}[\|U-V\|^{2% }].

(91)

Now, define $P_{U_{G}V_{G}}$ to be a joint Gaussian distribution such that


$\displaystyle\mathbbm{E}[U_{G}]$	$\displaystyle=$	$\displaystyle\mathbbm{E}[U^{*}],$	(92a)
$\displaystyle\mathbbm{E}[V_{G}]$	$\displaystyle=$	$\displaystyle\mathbbm{E}[V^{*}],$	(92b)
$\displaystyle\text{cov}(U_{G},V_{G})$	$\displaystyle=$	$\displaystyle\text{cov}(U^{},V^{}).$	(92c)

Then, we have the following set of inequalities:

$\displaystyle P\geq W_{2}^{2}(P_{X},P_{\hat{X}^{*}})$	$\displaystyle=$	$\displaystyle\inf_{\begin{subarray}{c}\tilde{P}_{UV}:\\ \tilde{P}_{U}=P_{X}\\ \tilde{P}_{V}=P_{\hat{X}^{*}}\end{subarray}}\mathbbm{E}_{\tilde{P}}[\\|U-V\\|^{2}]$	(93)
	$\displaystyle=$	$\displaystyle\mathbbm{E}[\\|U^{}-V^{}\\|^{2}]$	(94)
	$\displaystyle=$	$\displaystyle\mathbbm{E}[\\|U_{G}-V_{G}\\|^{2}]$	(95)
	$\displaystyle\geq$	$\displaystyle W_{2}^{2}(P_{U_{G}},P_{V_{G}})$	(96)
	$\displaystyle=$	$\displaystyle\inf_{\begin{subarray}{c}\hat{P}_{UV}:\\ \hat{P}_{U}=P_{U_{G}}\\ \hat{P}_{V}=P_{V_{G}}\end{subarray}}\mathbbm{E}_{\hat{P}}[\\|U-V\\|^{2}]$	(97)
	$\displaystyle=$	$\displaystyle\inf_{\begin{subarray}{c}\hat{P}_{UV}:\\ \hat{P}_{U}=P_{X}\\ \hat{P}_{V}=P_{\hat{X}_{G}}\end{subarray}}\mathbbm{E}_{\hat{P}}[\\|U-V\\|^{2}]$	(98)
	$\displaystyle=$	$\displaystyle W_{2}^{2}(P_{X},P_{\hat{X}_{G}}),$	(99)

where

•

(94) follows from the definition in (91);
•

(95) follows from (92) which states that $(U^{*},V^{*})$ and $(U^{G},V^{G})$ have the same first- and second-order statistics;
•

(98) follows because $P_{V_{G}}=P_{\hat{X}_{G}}$ and $P_{U_{G}}=P_{X}$ , which are justified as follows. First, notice that both $P_{V_{G}}$ and $P_{\hat{X}_{G}}$ are Gaussian distributions. According to (92), the first- and second-order statistics of $V_{G}$ are equal to those of $V^{*}$ . Also, from (91), we know that $P_{V^{*}}=P_{\hat{X}^{*}}$ , hence the first- and second-order statistics of $V^{*}$ and $\hat{X}^{*}$ are the same. On the other side, from (79), we know that the first- and second-order statistics of $\hat{X}^{*}$ are equal to those of $\hat{X}_{G}$ . Thus, we conclude that $P_{V_{G}}=P_{\hat{X}_{G}}$ . A similar argument shows that $P_{U_{G}}=P_{X}$ .

Thus, without loss of optimality one can replace $\hat{X}^{*}$ by $\hat{X}_{G}$ since the rate does not increase, while the distortion and perception constraints remain to be satisfied.

Appendix C Proof of Theorem 3

We aim to establish the RDP function for the case of KL-divergence as the perception metric by showing that

\displaystyle R(D,P)=R^{*}(D,P),

(100)

where


	$\displaystyle R^{*}(D,P)=\min_{\{\hat{\lambda}_{\ell},\gamma_{{\ell}}\}_{\ell=% 1}^{L}}\;\;\frac{1}{2}\sum_{\ell=1}^{L}\log\frac{\lambda_{\ell}}{\gamma_{{\ell% }}}$		(101a)
	$\displaystyle\hskip 76.82234pt\text{s.t.}\qquad 0<\gamma_{\ell}\leq\lambda_{% \ell},$		(101b)
	$\displaystyle\hskip 113.81102pt0\leq\hat{\lambda}_{\ell},$		(101c)
	$\displaystyle\hskip 113.81102pt\sum_{\ell=1}^{L}\left(\lambda_{{\ell}}-2\sqrt{% \hat{\lambda}_{{\ell}}(\lambda_{{\ell}}-\gamma_{{\ell}})}+\hat{\lambda}_{{\ell% }}\right)\leq D,$		(101d)
	$\displaystyle\hskip 113.81102pt\frac{1}{2}\sum_{\ell=1}^{L}\left(\frac{\hat{% \lambda}_{{\ell}}}{\lambda_{{\ell}}}-1+\log\frac{\lambda_{{\ell}}}{\hat{% \lambda}_{{\ell}}}\right)\leq P.$		(101e)

C-1 Proof of $R^{*}(D,P)\geq R(D,P)$

Let $\{\gamma_{\ell},\hat{\lambda}_{\ell}\}_{\ell=1}^{L}$ be the optimal solution of (101). For $\ell\in\{1,\ldots,L\}$ , let $\hat{Z}_{G,\ell}^{*}$ be jointly Gaussian with $Z_{\ell}$ with their covariance matrix as given in (27), and be independent of all other $Z_{\ell^{\prime}}$ , i.e., $\forall\ell^{\prime}\neq\ell$ . Let $\hat{Z}_{G}^{*}=(\hat{Z}_{G,1}^{*},\ldots,\hat{Z}_{G,L}^{*})$ . Further, set $\hat{X}^{*}_{G}=\Theta^{T}\hat{Z}^{*}_{G}$ . It can be verified that

$\displaystyle\mathbb{E}[\\|X-\hat{X}^{*}_{G}\\|^{2}]$	$\displaystyle=$	$\displaystyle\mathbb{E}[\\|Z-\hat{Z}^{*}_{G}\\|^{2}]$	(102)
	$\displaystyle=$	$\displaystyle\sum\limits_{\ell=1}^{L}\mathbb{E}[(Z_{\ell}-\hat{Z}^{*}_{G,\ell}% )^{2}]$	(103)
	$\displaystyle=$	$\displaystyle\sum_{\ell=1}^{L}\left(\lambda_{{\ell}}-2\sqrt{\hat{\lambda}_{{% \ell}}(\lambda_{{\ell}}-\gamma_{{\ell}})}+\hat{\lambda}_{{\ell}}\right)$	(104)
	$\displaystyle\leq$	$\displaystyle D,$	(105)

and

$\displaystyle D(P_{X^{*}_{G}}\\|P_{X})$	$\displaystyle=$	$\displaystyle D(P_{\hat{Z}^{*}_{G}}\\|P_{Z})$	(106)
	$\displaystyle=$	$\displaystyle\sum\limits_{\ell=1}^{L}D(P_{\hat{Z}^{*}_{G,\ell}}\\|P_{Z_{\ell}})$	(107)
	$\displaystyle=$	$\displaystyle\frac{1}{2}\sum_{\ell=1}^{L}\left(\frac{\hat{\lambda}_{{\ell}}}{% \lambda_{{\ell}}}-1+\log\frac{\lambda_{{\ell}}}{\hat{\lambda}_{{\ell}}}\right)$	(108)
	$\displaystyle\leq$	$\displaystyle P,$	(109)

where (102) and (106) are due to the invariance of KL-divergence and Euclidean distance under unitary transformations. Therefore, we must have $R(D,P)\leq I(X;\hat{X}^{*}_{G})$ . On the other hand,

$\displaystyle I(X;\hat{X}^{*}_{G})$	$\displaystyle=$	$\displaystyle I(Z;\hat{Z}^{*}_{G})$	(110)
	$\displaystyle=$	$\displaystyle\sum\limits_{\ell=1}^{L}I(Z_{\ell};\hat{Z}^{*}_{G,\ell})$	(111)
	$\displaystyle=$	$\displaystyle\frac{1}{2}\sum\limits_{\ell=1}^{L}\log\frac{\lambda_{\ell}}{% \gamma_{\ell}}$	(112)
	$\displaystyle=$	$\displaystyle R^{*}(D,P).$	(113)

This proves $R^{*}(D,P)\geq R(D,P)$ .

C-2 Proof of $R^{*}(D,P)\leq R(D,P)$

It follows from Theorem 2 that


$\displaystyle R(D,P)=$	$\displaystyle\inf_{P_{\hat{X}_{G}\|X}}$	$\displaystyle I(X;\hat{X}_{G}),$	(114a)
	s.t.	$\displaystyle\mathbb{E}[\\|X-\hat{X}_{G}\\|^{2}]\leq D,$	(114c)
		$\displaystyle D(P_{\hat{X}_{G}}\\|P_{X})\leq P,$	(114c)

where $\hat{X}_{G}$ has mean zero and is jointly Gaussian with $X$ . Let $P_{\hat{X}_{G}^{*}|X}$ be the optimal distribution of the program in (114) and define $\hat{Z}^{*}_{G}=\Theta\hat{X}^{*}_{G}$ . Let $\Sigma_{\hat{X}^{*}_{G}}$ be the covariance matrix of $\hat{X}^{*}_{G}$ and $\Lambda_{\hat{Z}^{*}_{G}}$ be a diagonal matrix whose diagonal elements coincide with those of $\Theta\Sigma_{\hat{X}^{*}_{G}}\Theta^{T}$ , i.e.,

\displaystyle\Lambda_{\hat{Z}^{*}_{G}}=\text{diag}^{L}(\hat{\lambda}_{1},% \ldots,\hat{\lambda}_{L}).

(115)

Furthermore, define

\displaystyle\gamma_{\ell}=\mathbbm{E}[(Z_{\ell}-\mathbb{E}[Z_{\ell}|\hat{Z}^{% *}_{G,\ell}])^{2}],\qquad\ell\in\{1,\ldots,L\}.

(116)

Clearly, (101b) and (101c) are satisfied.

It can be verified that

$\displaystyle I(X;\hat{X}^{*}_{G})$	$\displaystyle=$	$\displaystyle I(Z;\hat{Z}^{*}_{G})$	(117)
	$\displaystyle=$	$\displaystyle h(Z)-h(Z\|\hat{Z}^{*}_{G})$	(118)
	$\displaystyle=$	$\displaystyle\sum_{\ell=1}^{L}h(Z_{\ell})-h(Z\|\hat{Z}^{*}_{G})$	(119)
	$\displaystyle\geq$	$\displaystyle\sum_{\ell=1}^{L}h(Z_{\ell})-\sum_{\ell=1}^{L}h(Z_{\ell}\|\hat{Z}^% {*}_{G,\ell})$	(120)
	$\displaystyle=$	$\displaystyle\sum_{\ell=1}^{L}h(Z_{\ell})-\sum_{\ell=1}^{L}h(Z_{\ell}-\mathbbm% {E}[Z_{\ell}\|\hat{Z}^{}_{G,\ell}]\|\hat{Z}^{}_{G,\ell})$	(121)
	$\displaystyle=$	$\displaystyle\sum_{\ell=1}^{L}h(Z_{\ell})-\sum_{\ell=1}^{L}h(Z_{\ell}-\mathbbm% {E}[Z_{\ell}\|\hat{Z}^{*}_{G,\ell}])$	(122)
	$\displaystyle=$	$\displaystyle\sum_{\ell=1}^{L}\frac{1}{2}\log\left((2\pi e)\lambda_{{\ell}}% \right)-\sum_{\ell=1}^{L}\frac{1}{2}\log\left((2\pi e)\gamma_{{\ell}}\right)$	(123)
	$\displaystyle=$	$\displaystyle\sum_{\ell=1}^{L}\frac{1}{2}\log\frac{\lambda_{{\ell}}}{\gamma_{{% \ell}}},$	(124)

where

•

(117) is due to the invertibility of unitary transformations,
•

(119) follows because $Z_{1},\ldots,Z_{L}$ are independent,
•

(120) follows from the chain rule and that conditioning does not increase entropy,
•

(122) follows because $Z_{\ell}-\mathbbm{E}[Z_{\ell}|\hat{Z}^{*}_{G,\ell}]$ is independent of $\hat{Z}^{*}_{G,\ell}$ ,
•

(123) follows because $\mathbb{E}[Z_{\ell}^{2}]=\lambda_{{\ell}}$ and $\mathbb{E}[(Z_{\ell}-\mathbb{E}[Z_{\ell}|\hat{Z}^{*}_{G,\ell}])^{2}]=\gamma_{{% \ell}}$ .

Next, consider the expected distortion loss as follows:

$\displaystyle D\geq\mathbb{E}[\\|X-\hat{X}^{*}_{G}\\|^{2}]$	$\displaystyle=$	$\displaystyle\mathbb{E}[\\|Z-\hat{Z}^{*}_{G}\\|^{2}]$	(125)
	$\displaystyle=$	$\displaystyle\sum_{\ell=1}^{L}\mathbb{E}[(Z_{\ell}-\hat{Z}^{*}_{G,\ell})^{2}]$	(126)
	$\displaystyle=$	$\displaystyle\sum_{\ell=1}^{L}\mathbb{E}[Z_{\ell}^{2}]-2\mathbb{E}[Z_{\ell}% \hat{Z}^{}_{G,\ell}]+\mathbb{E}[(\hat{Z}^{}_{G,\ell})^{2}]$	(127)
	$\displaystyle=$	$\displaystyle\sum_{\ell=1}^{L}\lambda_{{\ell}}-2\mathbb{E}[Z_{\ell}\hat{Z}^{*}% _{G,\ell}]+\hat{\lambda}_{{\ell}}$	(128)
	$\displaystyle=$	$\displaystyle\sum_{\ell=1}^{L}\lambda_{{\ell}}-2\sqrt{\hat{\lambda}_{{\ell}}(% \lambda_{{\ell}}-\gamma_{{\ell}})}+\hat{\lambda}_{{\ell}}$	(129)

where

•

(125) is due to the invariance of Euclidean distance under unitary transformations,
•

(128) follows because $\mathbb{E}[Z_{\ell}^{2}]=\lambda_{{\ell}}$ and $\mathbb{E}[(\hat{Z}^{*}_{G,\ell})^{2}]=\hat{\lambda}_{{\ell}}$ ,
•

(129) follows from the identity $\mathbb{E}[(Z_{\ell}-\mathbb{E}[Z_{\ell}|\hat{Z}^{*}_{G,\ell}])^{2}]=\mathbb{E% }[Z_{\ell}^{2}]-(\mathbb{E}[Z_{\ell}\hat{Z}^{*}_{G,\ell}])^{2}(\mathbb{E}[\hat% {Z}^{*}_{G,\ell}])^{-1}$ , and $\mathbb{E}[(Z_{\ell}-\mathbb{E}[Z_{\ell}|\hat{Z}^{*}_{G,\ell}])^{2}]=\gamma_{\ell}$ , $\mathbb{E}[Z_{\ell}^{2}]=\lambda_{{\ell}}$ , $\mathbb{E}[(\hat{Z}^{*}_{G,\ell})^{2}]=\hat{\lambda}_{{\ell}}$ .

Finally, consider the perception loss:

$\displaystyle P\geq D(P_{\hat{X}^{*}_{G}}\\|P_{X})$	$\displaystyle=$	$\displaystyle\frac{1}{2}\left(\text{tr}(\Lambda_{X}^{-1}\Theta\Sigma_{\hat{X}^% {}_{G}}\Theta^{T})-L+\log\frac{\det(\Lambda_{X})}{\det(\Theta\Sigma_{\hat{X}^% {}_{G}}\Theta^{T})}\right)$	(130)
	$\displaystyle=$	$\displaystyle\frac{1}{2}\left(\text{tr}(\Lambda_{X}^{-1}\Lambda_{\hat{Z}^{}_{% G}})-L+\log\frac{\det(\Lambda_{X})}{\det(\Theta\Sigma_{\hat{X}^{}_{G}}\Theta^% {T})}\right)$	(131)
	$\displaystyle\geq$	$\displaystyle\frac{1}{2}\left(\text{tr}(\Lambda_{X}^{-1}\Lambda_{\hat{Z}^{}_{% G}})-L+\log\frac{\det(\Lambda_{X})}{\det(\Lambda_{\hat{Z}^{}_{G}})}\right)$	(132)
	$\displaystyle=$	$\displaystyle\frac{1}{2}\sum_{\ell=1}^{L}\left(\frac{\hat{\lambda}_{{\ell}}}{% \lambda_{{\ell}}}-1+\log\frac{\lambda_{{\ell}}}{\hat{\lambda}_{{\ell}}}\right),$	(133)

where

•

(131) follows because $\Lambda_{X}^{-1}$ is a diagonal matrix and thus the trace depends on the diagonal elements of $\Theta\Sigma_{\hat{X}^{*}_{G}}\Theta^{T}$ which are equal to the diagonal elements of $\Lambda_{\hat{Z}^{*}_{G}}$ ,
•

(132) follows from Hadamard’s inequality for a positive semidefinite matrix.

Combining (124), (129), and (133) yields $R^{*}(D,P)\leq R(D,P)$ .

Appendix D Proof of Theorem 4

First, we show that the optimization problem in (101) is convex. The second derivative of the objective function (101a) with respect to $\gamma_{\ell}$ is $\frac{1}{2\gamma_{\ell}^{2}}$ which is positive. The second derivative of the function in the constraint (101e) with respect to $\hat{\lambda}_{\ell}$ is $\frac{1}{2\hat{\lambda}_{\ell}^{2}}$ which is again positive. It just remains to study the constraint (101d). The Hessian matrix of the function in this constraint is

\displaystyle\begin{bmatrix}\frac{\sqrt{\lambda_{\ell}-\gamma_{\ell}}}{2\sqrt{% \hat{\lambda}^{3}_{\ell}}}&\frac{1}{2\sqrt{\hat{\lambda}_{\ell}(\lambda_{\ell}% -\gamma_{\ell})}}\\ \frac{1}{2\sqrt{\hat{\lambda}_{\ell}(\lambda_{\ell}-\gamma_{\ell})}}&\frac{% \sqrt{\hat{\lambda}_{\ell}}}{2\sqrt{(\lambda_{\ell}-\gamma_{\ell})^{3}}}\end{% bmatrix}.

(134)

The determinant of the above matrix is zero, and the matrix has positive diagonal terms. Thus, it is a positive semidefinite matrix, which implies the convexity of the associated function. This proves the convexity of the program in (101).

Since the $(D,P)$ is assumed to be strictly feasible, the Slater’s condition is satisfied. This implies that the solution to this problem is equal to that of the following dual optimization problem

		$\displaystyle\max_{\nu_{1},\nu_{2},\eta_{\ell},\xi_{\ell}\geq 0}\;\;\min_{\{% \gamma_{\ell},\hat{\lambda}_{\ell}\}_{\ell=1}^{L}}$	$\displaystyle\;\;\frac{1}{2}\sum_{\ell=1}^{L}\log\frac{\lambda_{\ell}}{\gamma_% {{\ell}}}+\nu_{1}\left(\sum_{\ell=1}^{L}(\lambda_{{\ell}}-2\sqrt{\hat{\lambda}% _{{\ell}}(\lambda_{{\ell}}-\gamma_{{\ell}})}+\hat{\lambda}_{{\ell}})-D\right)$
			$\displaystyle\quad+\nu_{2}\left(\frac{1}{2}\sum_{\ell=1}^{L}\left(\frac{\hat{% \lambda}_{{\ell}}}{\lambda_{{\ell}}}-1+\log\frac{\lambda_{{\ell}}}{\hat{% \lambda}_{{\ell}}}\right)-P\right)+\sum_{\ell=1}^{L}\xi_{\ell}(\gamma_{\ell}-% \lambda_{\ell})-\sum_{\ell=1}^{L}\eta_{\ell}\hat{\lambda}_{\ell},$

where $\{\nu_{1},\nu_{2}\}$ and $\{\xi_{\ell},\eta_{\ell}\}_{\ell=1}^{L}$ are nonnegative Lagrange multipliers. Note that the distortion function has implicit constraints $\hat{\lambda}_{\ell}\geq 0$ and $\gamma_{\ell}\leq\lambda_{\ell}$ . Moreover, the derivatives of the respective terms go to infinity when $\hat{\lambda}_{\ell}$ and $\gamma_{\ell}$ approach these boundaries. For this reason, we cannot immediately write down the Karush-Kuhn-Tucker (KKT) conditions for the optimization problem, and instead, need to carefully consider the behaviour of the optimization problem close to these boundaries. Toward this end, we consider the following three different cases.

D-1 Case Where the Maximum for the Outer Optimization Occurs at $\nu_{1},\nu_{2}>0$

This is the case where both perception and distortion constraints are active. Let $\hat{\lambda}^{*}_{\ell}$ and $\gamma_{\ell}^{*}$ be the optimal solution to the inner minimization problem in (LABEL:Lagrange-dual-function) for the optimal $\nu_{1}$ and $\nu_{2}$ . We first note that

\displaystyle\hat{\lambda}^{*}_{\ell}

\displaystyle>

\displaystyle 0.

(136)

This is because if $\hat{\lambda}^{*}_{\ell}=0$ , then we have $P=\infty$ which would violate the perception constraint.

Next, we show that the following strict inequality holds:

\displaystyle\gamma_{\ell}^{*}

\displaystyle<

\displaystyle\lambda_{\ell}.

(137)

Suppose that the above strict inequality does not hold, i.e., $\gamma_{\ell}^{*}=\lambda_{\ell}$ . We show that such $\gamma_{\ell}^{*}$ cannot be the optimal solution to the inner minimization problem.

The Lagrangian term in (LABEL:Lagrange-dual-function) depends on $\gamma_{\ell}$ and $\hat{\lambda}_{\ell}$ through the following function:

	$\displaystyle G_{\ell}(\gamma_{\ell},\hat{\lambda}_{\ell})$	$\displaystyle=$	$\displaystyle\frac{1}{2}\log\frac{\lambda_{\ell}}{\gamma_{\ell}}+\nu_{1}\left(% \lambda_{\ell}-2\sqrt{\hat{\lambda}_{\ell}(\lambda_{\ell}-\gamma_{\ell})}+\hat% {\lambda}_{\ell}\right)+\frac{\nu_{2}}{2}\left(\frac{\hat{\lambda}_{\ell}}{% \lambda_{\ell}}-1+\log\frac{\lambda_{\ell}}{\hat{\lambda}_{\ell}}\right)$		(138)
			$\displaystyle\hskip 28.45274pt+\xi_{\ell}(\gamma_{\ell}-\lambda_{\ell})-\eta_{% \ell}\hat{\lambda}_{\ell}.$		(138)

Fix $\hat{\lambda}_{\ell}=\hat{\lambda}_{\ell}^{*}$ . When we deviate from $\gamma_{\ell}^{*}=\lambda_{\ell}$ to $\gamma_{\ell}^{\prime}=\lambda_{\ell}-\epsilon$ for some small $\epsilon>0$ , the first order change in $G_{\ell}(\gamma_{\ell},\hat{\lambda}_{\ell}^{*})$ can be seen as follows:

$\displaystyle G_{\ell}(\gamma^{}_{\ell},\hat{\lambda}^{}_{\ell})-G_{\ell}(% \gamma^{\prime}_{\ell},\hat{\lambda}^{*}_{\ell})$	$\displaystyle=$	$\displaystyle\frac{1}{2}\log\frac{\lambda_{\ell}-\epsilon}{\lambda_{\ell}}+2% \nu_{1}\sqrt{\epsilon\hat{\lambda}^{*}_{{\ell}}}-\epsilon\xi_{\ell}$	(139)
	$\displaystyle=$	$\displaystyle-\frac{\epsilon}{2\lambda_{\ell}}+2\nu_{1}\sqrt{\epsilon\hat{% \lambda}^{*}_{{\ell}}}-\epsilon\xi_{\ell}+O(\epsilon^{2})$	(140)
	$\displaystyle=$	$\displaystyle 2\nu_{1}\sqrt{\epsilon\hat{\lambda}^{*}_{{\ell}}}+O(\epsilon)$	(141)

where we use the fact that $\log(1-x)=-x+O(x^{2})$ for small $x$ . Thus if $\nu_{1}>0$ , since $\hat{\lambda}_{\ell}^{*}>0$ , for sufficiently small $\epsilon>0$ , we can strictly decrease $G_{\ell}(\gamma^{*}_{\ell},\hat{\lambda}^{*}_{\ell})$ , while satisfying the implicit constraints. This contradicts the assumption that $\gamma^{*}_{\ell}=\lambda_{\ell}$ is the optimal solution to the inner minimization problem. This proves (137), which implies that every component has positive rate.

The strict inequalities in (137) and (136) imply that in this case, the optimal solution occurs at the interior of the set $\{\hat{\lambda}^{*}_{\ell}\geq 0\text{\ and\ }\gamma_{\ell}^{*}\leq\lambda_{% \ell}\}$ . This allows us to write down the KKT conditions for the optimal primal variables $(\gamma_{\ell}^{*},\hat{\lambda}_{{\ell}}^{*})$ and the optimal dual variables $\{\nu_{1},\nu_{2}\}$ and $\{\xi_{\ell},\eta_{\ell}\}_{\ell=1}^{L}$ as follows:


$\displaystyle\frac{1}{2\gamma^{}_{{\ell}}}-\nu_{1}\sqrt{\frac{{\hat{\lambda}_% {{\ell}}^{}}}{{\lambda_{{\ell}}-\gamma^{*}_{{\ell}}}}}-\xi_{\ell}$	$\displaystyle=$	$\displaystyle 0,$	(142a)
$\displaystyle\nu_{1}\left(-\sqrt{\frac{\lambda_{{\ell}}-\gamma^{}_{{\ell}}}{% \hat{\lambda}^{}_{{\ell}}}}+1\right)+\frac{1}{2}\nu_{2}\left(\frac{1}{\lambda% _{{\ell}}}-\frac{1}{\hat{\lambda}^{*}_{{\ell}}}\right)-\eta_{\ell}$	$\displaystyle=$	$\displaystyle 0,$	(142b)
$\displaystyle\xi_{\ell}(\gamma^{*}_{\ell}-\lambda_{\ell})$	$\displaystyle=$	$\displaystyle 0,$	(142c)
$\displaystyle\eta_{\ell}\hat{\lambda}^{*}_{\ell}$	$\displaystyle=$	$\displaystyle 0,$	(142d)
$\displaystyle\nu_{1}\left(\sum_{\ell=1}^{L}\left(\lambda_{{\ell}}-2\sqrt{\hat{% \lambda}_{{\ell}}^{}(\lambda_{{\ell}}-\gamma_{{\ell}}^{})}+\hat{\lambda}_{{% \ell}}^{*}\right)-D\right)$	$\displaystyle=$	$\displaystyle 0,$	(142e)
$\displaystyle\nu_{2}\left(\sum_{\ell=1}^{L}\frac{1}{2}\left(\frac{\hat{\lambda% }_{{\ell}}^{}}{\lambda_{{\ell}}}-1+\log\frac{\lambda_{{\ell}}}{\hat{\lambda}_% {{\ell}}^{}}\right)-P\right)$	$\displaystyle=$	$\displaystyle 0,$	(142f)

along with primal and dual feasibility constraints, i.e., $\eta_{\ell}\geq 0$ , $\xi_{\ell}\geq 0$ and (34e)-(34e).

Due to the strict inequalities (137) and (136), we have that $\xi_{\ell}=0$ and $\eta_{\ell}=0$ . Then, from condition (142a), we can write $\hat{\lambda}^{*}_{\ell}$ as follows

\displaystyle\hat{\lambda}_{\ell}^{*}=\frac{\lambda_{\ell}-\hat{\gamma}^{*}_{% \ell}}{4\gamma^{*2}_{\ell}\nu_{1}^{2}}.

(143)

Plugging (143) into (142b) yields the following second-order equation in $\gamma_{\ell}^{*}$

\displaystyle\nu_{1}(1-2\nu_{1}\gamma^{*}_{\ell})=\frac{1}{2}\nu_{2}\left(% \frac{4\gamma_{\ell}^{*2}\nu_{1}^{2}}{\lambda_{\ell}-\gamma_{\ell}^{*}}-\frac{% 1}{\lambda_{\ell}}\right).

(144)

Note that as $\gamma^{*}_{\ell}$ varies from $0$ to $\lambda_{\ell}$ , the left-hand side of (144) decreases monotonically from $\nu_{1}$ to $(1-2\nu_{1}\lambda_{\ell})\nu_{1}$ while the right-hand side of (144) increases monotonically from $-\frac{\nu_{2}}{2\lambda_{\ell}}$ to $+\infty$ So, this equation has a unique solution in the interval $(0,\lambda_{\ell})$ . The equation (144) is quadratic, so it can solved analytically. The solution gives (IV-C) and (37).

D-2 Case Where the Maximum for the Outer Optimization Occurs at $\nu_{1}>0,\nu_{2}=0$

This is the case where the distortion metric is active but the perception metric is inactive. Clearly, this reduces to the traditional rate-distortion function.

D-3 Case Where the Maximum for the Outer Optimization Occurs at $\nu_{1}=0$

This is the case where the distortion metric is inactive, so the inner minimization problem in (LABEL:Lagrange-dual-function) decouples into two independent minimizations, one for $\gamma_{\ell}$ and the other one for $\hat{\lambda}_{\ell}$ , i.e.,

	$\displaystyle\min_{\{\gamma_{\ell},\hat{\lambda}_{\ell}\}_{\ell=1}^{L}}\;\;% \frac{1}{2}\sum_{\ell=1}^{L}\log\frac{\lambda_{\ell}}{\gamma_{{\ell}}}+\nu_{2}% \left(\frac{1}{2}\sum_{\ell=1}^{L}\left(\frac{\hat{\lambda}_{{\ell}}}{\lambda_% {{\ell}}}-1+\log\frac{\lambda_{{\ell}}}{\hat{\lambda}_{{\ell}}}\right)-P\right% )+\sum_{\ell=1}^{L}\xi_{\ell}(\gamma_{\ell}-\lambda_{\ell})-\sum_{\ell=1}^{L}% \eta_{\ell}\hat{\lambda}_{\ell}$
	$\displaystyle\quad=\min_{\{\gamma_{\ell}\}_{\ell=1}^{L}}\;\;\frac{1}{2}\sum_{% \ell=1}^{L}\log\frac{\lambda_{\ell}}{\gamma_{{\ell}}}+\sum_{\ell=1}^{L}\xi_{% \ell}(\gamma_{\ell}-\lambda_{\ell})$
	$\displaystyle\qquad+\min_{\{\hat{\lambda}_{\ell}\}_{\ell=1}^{L}}\nu_{2}\left(% \frac{1}{2}\sum_{\ell=1}^{L}\left(\frac{\hat{\lambda}_{{\ell}}}{\lambda_{{\ell% }}}-1+\log\frac{\lambda_{{\ell}}}{\hat{\lambda}_{{\ell}}}\right)-P\right)-\sum% _{\ell=1}^{L}\eta_{\ell}\hat{\lambda}_{\ell}.$		(145)

For the first optimization problem in (145), its KKT conditions are given by

	$\displaystyle\frac{1}{2\gamma^{*}_{{\ell}}}-\xi_{\ell}$	$\displaystyle=$	$\displaystyle 0,$		(146)
	$\displaystyle\xi_{\ell}(\gamma^{*}_{\ell}-\lambda_{\ell})$	$\displaystyle=$	$\displaystyle 0.$		(147)

The above two conditions imply that

\displaystyle\gamma_{\ell}^{*}=\lambda_{\ell}.

(148)

So each component has zero rate.

For the second minimization problem in (145), this is the Lagrangian dual of a feasibility problem with the perception constraint only. Thus, we can choose $\hat{\lambda}_{\ell}^{*}$ to satisfy the primal constraints:

\sum_{\ell=1}^{L}{P}_{\ell}(\hat{\lambda}^{*}_{\ell})\leq P,\ \ \text{and}\ \ % \ \hat{\lambda}_{\ell}^{*}\geq 0.

(149)

Note that despite that the distortion constraint is already assumed to be inactive, we still need to impose an additional distortion constraint on $\hat{\lambda}_{\ell}^{*}$ :

\sum_{\ell=1}^{L}\lambda_{\ell}+\hat{\lambda}^{*}_{\ell}\leq D.

(150)

This is because not all $\hat{\lambda}_{\ell}^{*}$ ’s satisfying (149) satisfy the constraint (150). A constraint being inactive simply means that if the constraint is removed, there is already at least one optimal solution that automatically satisfies the constraint. In this case, there are multiple optimal solutions, all giving the same objective value (of zero rate). So we need to restrict to the ones that satisfy (150). Note that the left-hand side of (150) is the distortion of the reconstruction at zero rate.

Appendix E Proof of Theorem 5

We now establish the RDP Function with the Wasserstein-2 distance as the perception metric. The proof follows similar steps to those of the KL-divergence metric in Appendix C. We just need to rewrite the lower bounding steps for the perception metric. Let $P_{\hat{X}^{*}_{G}|X}$ be the optimal conditional distribution of the following optimization program


$\displaystyle R(D,P)=$	$\displaystyle\inf_{P_{\hat{X}_{G}\|X}}$	$\displaystyle I(X;\hat{X}_{G}),$	(151a)
	s.t.	$\displaystyle\mathbb{E}[\\|X-\hat{X}_{G}\\|^{2}]\leq D,$	(151c)
		$\displaystyle W_{2}^{2}(P_{X},P_{\hat{X}_{G}})\leq P,$	(151c)

where $\hat{X}_{G}$ has mean zero and is jointly Gaussian with $X$ . Let $\hat{Z}^{*}_{G}=\Theta\hat{X}^{*}_{G}$ and $\Sigma_{\hat{X}^{*}_{G}}$ be the covariance matrix of $\hat{X}^{*}_{G}$ and $\Lambda_{\hat{Z}^{*}_{G}}$ be a diagonal matrix whose diagonal elements coincide with those of $\Theta\Sigma_{\hat{X}^{*}_{G}}\Theta^{T}$ , i.e.,

\displaystyle\Lambda_{\hat{Z}^{*}_{G}}=\text{diag}^{L}(\hat{\lambda}_{1},% \ldots,\hat{\lambda}_{L}).

(152)

The lower bounding steps for the perception metric are as follows

$\displaystyle W_{2}^{2}(P_{X},P_{\hat{X}^{*}_{G}})$	$\displaystyle=$	$\displaystyle\text{tr}(\Sigma_{X}+\Sigma_{\hat{X}^{}_{G}}-2(\Sigma_{X}^{\frac% {1}{2}}\Sigma_{\hat{X}^{}_{G}}\Sigma_{X}^{\frac{1}{2}})^{\frac{1}{2}})$	(153)
	$\displaystyle=$	$\displaystyle\text{tr}(\Theta\Sigma_{X}\Theta^{T}+\Theta\Sigma_{\hat{X}^{}_{G% }}\Theta^{T}-2\Theta(\Sigma_{X}^{\frac{1}{2}}\Sigma_{\hat{X}^{}_{G}}\Sigma_{X% }^{\frac{1}{2}})^{\frac{1}{2}}\Theta^{T})$	(154)
	$\displaystyle=$	$\displaystyle\text{tr}(\Theta\Sigma_{X}\Theta^{T}+\Theta\Sigma_{\hat{X}^{}_{G% }}\Theta^{T}-2(\Theta\Sigma_{X}^{\frac{1}{2}}\Sigma_{\hat{X}^{}_{G}}\Sigma_{X% }^{\frac{1}{2}}\Theta^{T})^{\frac{1}{2}})$	(155)
	$\displaystyle=$	$\displaystyle\text{tr}(\Theta\Sigma_{X}\Theta^{T}+\Theta\Sigma_{\hat{X}^{}_{G% }}\Theta^{T}-2(\Theta\Sigma_{X}^{\frac{1}{2}}\Theta^{T}\Theta\Sigma_{\hat{X}^{% }_{G}}\Theta^{T}\Theta\Sigma_{X}^{\frac{1}{2}}\Theta^{T})^{\frac{1}{2}})$	(156)
	$\displaystyle=$	$\displaystyle\text{tr}(\Theta\Sigma_{X}\Theta^{T}+\Theta\Sigma_{\hat{X}^{}_{G% }}\Theta^{T}-2((\Theta\Sigma_{X}\Theta^{T})^{\frac{1}{2}}\Theta\Sigma_{\hat{X}% ^{}_{G}}\Theta^{T}(\Theta\Sigma_{X}\Theta^{T})^{\frac{1}{2}})^{\frac{1}{2}})$	(157)
	$\displaystyle=$	$\displaystyle W_{2}^{2}(P_{\Theta X},P_{\Theta\hat{X}^{*}_{G}})$	(158)
	$\displaystyle=$	$\displaystyle W_{2}^{2}(P_{Z},P_{\hat{Z}^{*}_{G}})$	(159)
	$\displaystyle\geq$	$\displaystyle\sum_{\ell=1}^{L}W_{2}^{2}(P_{Z_{\ell}},P_{\hat{Z}^{*}_{G,\ell}})$	(160)
	$\displaystyle=$	$\displaystyle\sum_{\ell=1}^{L}(\sqrt{\mathbbm{E}[(Z_{\ell})^{2}]}-\sqrt{% \mathbbm{E}[(\hat{Z}^{*}_{G,\ell})^{2}]})^{2}$	(161)
	$\displaystyle=$	$\displaystyle\sum_{\ell=1}^{L}\left(\sqrt{\lambda_{{\ell}}}-\sqrt{\hat{\lambda% }_{{\ell}}}\right)^{2},$	(162)

where

•

(154) follows because the trace is invariant under unitary transformations;
•

(155) and (157) follow because for a given matrix $A$ , $(\Theta A\Theta^{T})^{\frac{1}{2}}=\Theta A^{\frac{1}{2}}\Theta^{T}$ since $\Theta$ is a unitary matrix;
•

(156) follows because $\Theta^{T}\Theta=I$ ;
•

(159) follows from the definitions $Z=\Theta X$ and $\hat{Z}^{*}_{G}=\Theta\hat{X}^{*}_{G}$ ;
•

(160) follows from the tensorization property of Wasserstein-2 distance, i.e., for given distributions $P_{X_{1}X_{2}}$ and $P_{Y_{1}Y_{2}}$ , we have $W_{2}^{2}(P_{X_{1}X_{2}},P_{Y_{1}Y_{2}})\geq W_{2}^{2}(P_{X_{1}},P_{Y_{1}})+W_% {2}^{2}(P_{X_{2}},P_{Y_{2}})$ ;
•

(162) follows from (2) and (152).

On the other hand, the inequality in (160) becomes an equality if $\hat{X}^{*}_{G}=\Theta^{T}\hat{Z}^{*}_{G}$ with $\hat{Z}^{*}_{G}$ constructed in such a way that $(Z_{\ell},\hat{Z}^{*}_{G,\ell})$ , $\ell\in\{1,\ldots,L\}$ , are mutually independent and their covariance matrices are given by (27). Thus, the RDP function for the Wassertein-2 distance as perception metric is given by the following optimization problem:


	$\displaystyle R(D,P)=\min_{\{\hat{\lambda}_{\ell},\gamma_{{\ell}}\}_{\ell=1}^{% L}}\;\;\frac{1}{2}\sum_{\ell=1}^{L}\log\frac{\lambda_{\ell}}{\gamma_{{\ell}}}$		(163a)
	$\displaystyle\hskip 71.13188pt\text{s.t.}\qquad 0<\gamma_{\ell}\leq\lambda_{% \ell},$		(163b)
	$\displaystyle\hskip 113.81102pt0\leq\hat{\lambda}_{\ell},$		(163c)
	$\displaystyle\hskip 113.81102pt\sum_{\ell=1}^{L}\left(\lambda_{{\ell}}-2\sqrt{% \hat{\lambda}_{{\ell}}(\lambda_{{\ell}}-\gamma_{{\ell}})}+\hat{\lambda}_{{\ell% }}\right)\leq D,$		(163d)
	$\displaystyle\hskip 113.81102pt\sum_{\ell=1}^{L}\left(\sqrt{\lambda_{{\ell}}}-% \sqrt{\hat{\lambda}_{{\ell}}}\right)^{2}\leq P.$		(163e)

Appendix F Proof of Theorem 6

First, note that the optimization problem is convex for the Wasserstein-2 distance as justified below. The argument for the rate and distortion constraints is the same as the KL-divergence metric. The second derivative of the perception constraint in (163e) with respect to $\hat{\lambda}_{\ell}$ is $\frac{1}{2}\sqrt{\frac{\lambda_{\ell}}{\hat{\lambda}^{3}_{\ell}}}$ , which is positive.

The optimization problem can be analyzed in the same way as in Appendix D, except the case of $\nu_{1},\nu_{2}>0$ , which is discussed as follows. Here, we need a different proof to show the inequality

\displaystyle\hat{\lambda}^{*}_{\ell}

\displaystyle>

\displaystyle 0.

(164)

(The proof uses the same technique as the one showing $\gamma_{\ell}^{*}<\lambda_{\ell}$ in Appendix D-1.) Consider the following Lagrange dual optimization

		$\displaystyle\max_{\nu_{1},\nu_{2},\eta_{\ell},\xi_{\ell}\geq 0}\;\;\min_{\{% \gamma_{\ell},\hat{\lambda}_{\ell}\}_{\ell=1}^{L}}$	$\displaystyle\;\;\frac{1}{2}\sum_{\ell=1}^{L}\log\frac{\lambda_{\ell}}{\gamma_% {{\ell}}}+\nu_{1}\left(\sum_{\ell=1}^{L}(\lambda_{{\ell}}-2\sqrt{\hat{\lambda}% _{{\ell}}(\lambda_{{\ell}}-\gamma_{{\ell}})}+\hat{\lambda}_{{\ell}})-D\right)$		(165)
			$\displaystyle+\nu_{2}\left(\sum_{\ell=1}^{L}\left(\sqrt{\lambda_{{\ell}}}-% \sqrt{\hat{\lambda}_{{\ell}}}\right)^{2}-P\right)+\sum_{\ell=1}^{L}\xi_{\ell}(% \gamma_{\ell}-\lambda_{\ell})-\sum_{\ell=1}^{L}\eta_{\ell}\hat{\lambda}_{\ell}.$		(165)

Suppose that the strict inequality in (164) does not hold, i.e., $\hat{\lambda}_{\ell}^{*}=0$ . We show that such $\hat{\lambda}_{\ell}^{*}$ cannot be the optimal solution to the inner minimization problem.

The Lagrangian term in (165) depends on $\gamma_{\ell}$ and $\hat{\lambda}_{\ell}$ through the following function:

	$\displaystyle G^{\prime}_{\ell}(\gamma_{\ell},\hat{\lambda}_{\ell})$	$\displaystyle=$	$\displaystyle\frac{1}{2}\log\frac{\lambda_{\ell}}{\gamma_{\ell}}+\nu_{1}\left(% \lambda_{\ell}-2\sqrt{\hat{\lambda}_{\ell}(\lambda_{\ell}-\gamma_{\ell})}+\hat% {\lambda}_{\ell}\right)+\nu_{2}\left(\sqrt{\lambda_{\ell}}-\sqrt{\hat{\lambda}% _{\ell}}\right)^{2}$		(166)
			$\displaystyle\hskip 28.45274pt+\xi_{\ell}(\gamma_{\ell}-\lambda_{\ell})-\eta_{% \ell}\hat{\lambda}_{\ell}.$		(166)

We fix $\gamma_{\ell}=\gamma_{\ell}^{*}$ and then deviate from $\hat{\lambda}_{\ell}^{*}=0$ to $\hat{\lambda}_{\ell}^{\prime}=\epsilon$ for some small $\epsilon>0$ . The first order change in $G^{\prime}_{\ell}(\gamma^{*}_{\ell},\hat{\lambda}_{\ell})$ can be seen as follows:

	$\displaystyle G^{\prime}_{\ell}(\gamma^{}_{\ell},\hat{\lambda}^{}_{\ell})-G^% {\prime}_{\ell}(\gamma^{*}_{\ell},\hat{\lambda}^{\prime}_{\ell})$	$\displaystyle=$	$\displaystyle\nu_{1}(2\sqrt{\epsilon(\lambda_{\ell}-\gamma_{\ell}^{*})}-% \epsilon)+\nu_{2}(2\sqrt{\lambda_{\ell}\epsilon}-\epsilon)+\eta_{\ell}\epsilon$		(167)
		$\displaystyle=$	$\displaystyle 2(\nu_{2}\sqrt{\lambda_{\ell}}+\nu_{1}\sqrt{\lambda-\gamma^{*}_{% \ell}})\sqrt{\epsilon}+O(\epsilon).$		(168)

Thus, if $\nu_{2}>0$ , for sufficiently small $\epsilon>0$ , we can strictly decrease $G^{\prime}_{\ell}(\gamma^{*}_{\ell},\hat{\lambda}^{*}_{\ell})$ , while satisfying the implicit constraints. This contradicts with the assumption that $\hat{\lambda}^{*}_{\ell}=0$ is the optimal solution to the inner minimization problem. This proves (164). Given the strict inequality in (164), similar to the KL-divergence metric, we can show that

\displaystyle\gamma_{\ell}^{*}<\lambda_{\ell}.

(169)

The strict inequalities in (169) and (164) imply that each component has a positive rate, and further $\xi_{\ell}=\eta_{\ell}=0$ . Thus, we can write down the following KKT conditions


$\displaystyle\frac{1}{2\gamma^{}_{{\ell}}}-\nu_{1}\sqrt{\frac{{\hat{\lambda}_% {{\ell}}^{}}}{\lambda_{{\ell}}-\gamma^{*}_{{\ell}}}}$	$\displaystyle=$	$\displaystyle 0,$	(170a)
$\displaystyle\nu_{1}\left(-\sqrt{\frac{\lambda_{{\ell}}-\gamma^{}_{{\ell}}}{% \hat{\lambda}^{}_{{\ell}}}}+1\right)+\nu_{2}\left(1-\sqrt{\frac{\lambda_{\ell% }}{\hat{\lambda}^{*}_{\ell}}}\right)$	$\displaystyle=$	$\displaystyle 0,$	(170b)
$\displaystyle\sum_{\ell=1}^{L}(\lambda_{{\ell}}-2\sqrt{\hat{\lambda}_{{\ell}}^% {}(\lambda_{{\ell}}-\gamma_{{\ell}}^{})}+\hat{\lambda}_{{\ell}}^{*})$	$\displaystyle=$	$\displaystyle D,$	(170c)
$\displaystyle\sum_{\ell=1}^{L}\left(\sqrt{\lambda_{{\ell}}}-\sqrt{\hat{\lambda% }^{*}_{{\ell}}}\right)^{2}$	$\displaystyle=$	$\displaystyle P.$	(170d)

The derivation of the optimal solution can now be shown as follows. Define

\displaystyle\theta_{\ell}

\displaystyle=

\displaystyle\sqrt{\frac{\lambda_{\ell}-\gamma^{*}_{\ell}}{\hat{\lambda}^{*}_{% \ell}}}.

(171)

Plugging the above definition into (170b) yields

\displaystyle\hat{\lambda}^{*}_{\ell}

\displaystyle=

\displaystyle\frac{\lambda_{\ell}}{\left(1+\frac{(1-\theta_{\ell})\nu_{1}}{\nu% _{2}}\right)^{2}},

(172)

Also, from (170a), we get

\displaystyle\gamma^{*}_{\ell}=\frac{\theta_{\ell}}{2\nu_{1}}.

(173)

Plugging (172) and (173) into (171), we get the following equation:

\displaystyle\frac{\theta_{\ell}}{1+\frac{(1-\theta_{\ell})\nu_{1}}{\nu_{2}}}=% \sqrt{1-\frac{\theta_{\ell}}{2\nu_{1}\lambda_{\ell}}}.

(174)

Note that the function $\frac{\theta_{\ell}}{1+\frac{(1-\theta_{\ell})\nu_{1}}{\nu_{2}}}$ is an increasing function in $\theta_{\ell}$ . Also, the function $\sqrt{1-\frac{\theta_{\ell}}{2\nu_{1}\lambda_{\ell}}}$ as defined in $\theta_{\ell}\in[0,2\nu_{1}\lambda_{\ell}]$ is a decreasing function in $\theta_{\ell}$ . So, the solution to the above equation is unique.

Thus, $\hat{\lambda}^{*}_{\ell}$ and $\gamma_{\ell}^{*}$ in (172) and (173) can be obtained from $\theta_{\ell}$ , which is determined via (174). This proves (51) and (52).

Appendix G Proof of Corollary 1

If $P=0$ , this falls under the first case in Theorem 4 and Theorem 6. Here, we have

\displaystyle R(D,0)

\displaystyle=

\displaystyle\frac{1}{2}\sum_{\ell=1}^{L}\log\frac{\lambda_{\ell}}{\gamma^{*}_% {\ell}(D,0)}.

(175)

The perception constraint (45) and (55) with $P=0$ implies that $\hat{\lambda}^{*}_{\ell}(D,0)=\lambda_{\ell}$ for every $\ell\in\{1,\ldots,L\}$ . Now, using the expression of optimal $\gamma_{\ell}^{*}$ in (40) together with $\hat{\lambda}^{*}_{\ell}=\lambda_{\ell}$ , we have

\displaystyle\gamma_{\ell}^{*}(D,0)

\displaystyle=

\displaystyle\frac{2\lambda_{\ell}}{1+\sqrt{1+16\nu_{1}^{2}\lambda_{\ell}^{2}}},

(176)

where $\nu_{1}$ is chosen to satisfy the distortion constraint (44) and (54), i.e.,

\displaystyle D=\sum_{\ell=1}^{L}\left(2\lambda_{\ell}-2\sqrt{\lambda_{\ell}(% \lambda_{\ell}-\gamma_{\ell}^{*}(D,0))}\right).

(177)

Combining the above proves the desired result.

Appendix H Asymptotic Analysis for Perceptually Perfect Reconstruction

We utilize the optimal solution for the perceptually perfect reconstruction case in Corollary 1, i.e., (175), (176) and (177).

H-1 High-Distortion Compression

Let $D=\left(\sum_{\ell=1}^{L}2\lambda_{\ell}\right)-\epsilon$ for some small $\epsilon>0$ . Note that by (177), this means that we are setting $\epsilon$ to be

\epsilon=\sum_{\ell=1}^{L}2\sqrt{\lambda_{\ell}(\lambda_{\ell}-\gamma^{*}_{% \ell}(D,0))}.

(178)

In this case, $\gamma_{\ell}^{*}(D,0)$ should be close to $\lambda_{\ell}$ , and the rate is close to zero. By (176), this also means that $\nu_{1}$ must be close to zero. Then, we can approximate $\gamma_{\ell}^{*}(D,0)$ as follows:

$\displaystyle\gamma_{\ell}^{*}(D,0)$	$\displaystyle=$	$\displaystyle\frac{2\lambda_{\ell}}{1+\sqrt{1+16\lambda_{\ell}^{2}\nu_{1}^{2}}}$	(179)
	$\displaystyle=$	$\displaystyle\frac{\lambda_{\ell}}{1+4\nu_{1}^{2}\lambda^{2}_{\ell}+O(\nu_{1}^% {4})}$	(180)
	$\displaystyle=$	$\displaystyle\lambda_{\ell}(1-4\nu_{1}^{2}\lambda^{2}_{\ell})+O(\nu_{1}^{4}).$	(181)

Plugging the above into (178) yields

\displaystyle\epsilon=4\nu_{1}\sum_{\ell=1}^{L}\lambda^{2}_{\ell}+O(\nu_{1}^{2% }).

(182)

The rate expression can now be approximated as follows

$\displaystyle R\left(2\sum_{\ell=1}^{L}\lambda_{\ell}-\epsilon,0\right)$	$\displaystyle=$	$\displaystyle\frac{1}{2}\sum_{\ell=1}^{L}\log\frac{1+\sqrt{1+16\nu_{1}^{2}% \lambda^{2}_{\ell}}}{2}$	(183)
	$\displaystyle=$	$\displaystyle\frac{1}{2}\sum_{\ell=1}^{L}\log(1+4\nu_{1}^{2}\lambda_{\ell}^{2}% +O(\nu_{1}^{4}))$	(184)
	$\displaystyle=$	$\displaystyle\frac{1}{2}\sum_{\ell=1}^{L}4\nu_{1}^{2}\lambda_{\ell}^{2}+O(\nu_% {1}^{4}),$	(185)

Now, using (182) and (185) to eliminate $\nu_{1}$ , we get

\displaystyle R\left(2\sum_{\ell=1}^{L}\lambda_{\ell}-\epsilon,0\right)=\frac{% \epsilon^{2}}{8\sum_{\ell=1}^{L}\lambda_{\ell}^{2}}+O(\epsilon^{3}).

(186)

To derive the expression for the water-level, we use (182) in (181) to get

\displaystyle\gamma^{*}_{\ell}\left(2\sum_{\ell=1}^{L}\lambda_{\ell}-\epsilon,% 0\right)=\lambda_{\ell}-\frac{\epsilon^{2}\lambda_{\ell}^{3}}{4\left(\sum_{% \ell=1}^{L}\lambda_{\ell}^{2}\right)^{2}}+O(\epsilon^{3}),\quad\ell\in\{1,% \ldots,L\}.

(187)

H-2 Low-Distortion Compression

Let $D=\epsilon$ for some small $\epsilon>0$ . Note that as $\epsilon\rightarrow 0$ , we must have $\gamma_{\ell}^{*}\rightarrow 0$ by (177), and consequently $\nu_{1}\rightarrow\infty$ by (176). In this regime, we can approximate the water-levels in (176) as follows

	$\displaystyle\gamma_{\ell}^{*}(D,0)$	$\displaystyle=$	$\displaystyle\frac{2\lambda_{\ell}}{1+\sqrt{1+16\lambda_{\ell}^{2}\nu_{1}^{2}}}$		(188)
		$\displaystyle=$	$\displaystyle\frac{1}{2\nu_{1}}-\frac{1}{8\nu_{1}^{2}\lambda_{\ell}}+O\left(% \frac{1}{\nu_{1}^{3}}\right).$		(189)

Plugging (189) into the distortion constraint (177), we have

	$\displaystyle\epsilon$	$\displaystyle=$	$\displaystyle\sum_{\ell=1}^{L}\left(2\lambda_{\ell}-2\sqrt{\lambda_{\ell}\left% (\lambda_{\ell}-\gamma_{\ell}^{*}(D,0)\right)}\right)$		(190)
		$\displaystyle=$	$\displaystyle\frac{L}{2\nu_{1}}-\frac{1}{16\nu_{1}^{2}}\sum_{\ell=1}^{L}\frac{% 1}{\lambda_{\ell}}+O\left(\frac{1}{\nu_{1}^{3}}\right),$		(191)

which implies

\displaystyle\frac{1}{\nu_{1}}=\frac{2\epsilon}{L}+\frac{\epsilon^{2}}{2L^{3}}% \sum\limits_{\ell=1}^{L}\frac{1}{\lambda_{\ell}}.

(192)

Substituting (192) into (189) shows that the water-levels in the low-distortion regime are given by

\displaystyle\gamma_{\ell}^{*}(\epsilon,0)=\frac{\epsilon}{L}-\frac{\epsilon^{% 2}}{2L^{2}\lambda_{\ell}}+\frac{\epsilon^{2}}{4L^{3}}\sum_{\ell=1}^{L}\frac{1}% {\lambda_{\ell}}+O(\epsilon^{3}),\quad\ell\in\{1,\ldots,L\}.

(193)

The rate expression can now be approximated as follows

$\displaystyle R(\epsilon,0)$	$\displaystyle=$	$\displaystyle\frac{1}{2}\sum_{\ell=1}^{L}\log\frac{\lambda_{\ell}}{\gamma^{*}_% {\ell}(\epsilon,0)}$	(194)
	$\displaystyle=$	$\displaystyle\frac{1}{2}\sum_{\ell=1}^{L}\log\frac{L\lambda_{\ell}}{\epsilon}-% \frac{1}{2}\sum\limits_{\ell=1}^{L}\log\left(1-\frac{\epsilon}{2L\lambda_{\ell% }}+\frac{\epsilon}{4L^{2}}\sum_{\ell^{\prime}=1}^{L}\frac{1}{\lambda_{\ell^{% \prime}}}+O(\epsilon^{2})\right)$	(195)
	$\displaystyle=$	$\displaystyle\frac{1}{2}\sum_{\ell=1}^{L}\log\frac{L\lambda_{\ell}}{\epsilon}-% \frac{1}{2}\sum\limits_{\ell=1}^{L}\left(-\frac{\epsilon}{2L\lambda_{\ell}}+% \frac{\epsilon}{4L^{2}}\sum_{\ell^{\prime}=1}^{L}\frac{1}{\lambda_{\ell^{% \prime}}}\right)+O(\epsilon^{2})$	(196)
	$\displaystyle=$	$\displaystyle\frac{1}{2}\sum_{\ell=1}^{L}\log\frac{L\lambda_{\ell}}{\epsilon}+% \frac{\epsilon}{8L}\sum_{\ell=1}^{L}\frac{1}{\lambda_{\ell}}+O(\epsilon^{2}).$	(197)

This concludes the proof.

References

[1] E. Agustsson, M. Tschannen, F. Mentzer, R. Timofte, and L. Van Gool, “Generative adversarial networks for extreme learned image compression,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 221–231.
[2] J. Ballé, V. Laparra, and E. P. Simoncelli, “End-to-end optimized image compression,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2017, pp. 1–27.
[3] L. Theis, W. Shi, A. Cunningham, and F. Huszár, “Lossy image compression with compressive autoencoders,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2017.
[4] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. V. Gool, “Conditional probability models for deep image compression,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 4394–4402.
[5] Y. Blau and T. Michaeli, “Rethinking lossy compression: The rate-distortion-perception tradeoff,” in Proc. ACM Int. Conf. Mach. Learn. (ICML), 2019, pp. 675–685.
[6] N. Saldi, T. Linder, and S. Yüksel, “Output constrained lossy source coding with limited common randomness,” IEEE Trans. Inf. Theory, vol. 61, no. 9, pp. 4984–4998, Jun. 2015.
[7] L. Theis and A. Wagner, “A coding theorem for the rate-distortion-perception function,” in Neural Compression Workshop of Int. Conf. Learn. Represent. (ICLR), 2021, p. 9.
[8] C. T. Li and A. El Gamal, “Strong functional representation lemma and applications to coding theorems,” IEEE Trans. Inf. Theory, vol. 64, no. 11, pp. 6967–6978, Nov. 2018.
[9] G. Zhang, J. Qian, J. Chen, and A. Khisti, “Universal rate-distortion-perception representations for lossy compression,” in Proc. Adv. Neural Inf. Process. Sys. (NeurIPS), 2021, pp. 11 517–11 529.
[10] A. B. Wagner, “The rate-distortion-perception tradeoff: The role of common randomness,” arXiv:2202.04147, 2022.
[11] J. Chen, L. Yu, J. Wang, W. Shi, Y. Ge, and W. Tong, “On the rate-distortion-perception function,” IEEE J. Sel. Areas Inf. Theory, vol. 3, no. 4, pp. 664–673, Dec. 2022.
[12] D. Freirich, T. Michaeli, and R. Meir, “A theory of the distortion-perception tradeoff in Wasserstein space,” Proc. Adv. Neural Inf. Process. Sys. (NeurIPS), vol. 34, pp. 25 661–25 672, 2021.
[13] Z. Yan, F. Wen, R. Ying, C. Ma, and P. Liu, “On perceptual lossy compression: The cost of perceptual reconstruction and an optimal training framework,” in Proc. ACM Int. Conf. Mach. Learn. (ICML), 2021, pp. 11 682–11 692.
[14] H. Liu, G. Zhang, J. Chen, and A. Khisti, “Lossy compression with distribution shift as entropy constrained optimal transport,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2022, pp. 1–6.
[15] S. Salehkalaibar, B. Phan, J. Chen, W. Yu, and A. Khisti, “On the choice of perception loss function for learned video compression,” in Proc. Adv. Neural Inf. Process. Sys. (NeurIPS), 2023.
[16] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd Ed. Wiley, 2006.
[17] L. Song, J. Chen, and C. Tian, “Broadcasting correlated vector gaussians,” IEEE Trans. Inf. Theory, vol. 61, no. 5, pp. 2465–2477, May 2015.

Rate-Distortion-Perception Tradeoff for Gaussian Vector Sources

Abstract

Index Terms:

I Introduction

II System Model and Preliminaries

Definition 1 (Operational RDP Functions)

Definition 2 (Information RDP Function)

III Traditional Reverse Water-Filling

Theorem 1 (Thm 10.3 in [16])

Proposition 1 (High-Distortion Compression)

Proof:

Proposition 2 (Low-Distortion Compression)

Proof:

IV Rate-Distortion-Perception Function

IV-A Optimality of Gaussian Reconstruction

Theorem 2

Proof:

IV-B RDP Function with KL Divergence as Perception Metric

Theorem 3

Proof:

IV-C Generalized Water-filling with KL Divergence as Perception Metric

Theorem 4

Proof:

IV-D RDP Function and Generalized Reverse Water-filling with Wasserstein-2 Distance as Perception Metric

Theorem 5

Proof:

Theorem 6

Proof:

IV-E Perceptually Perfect Reconstruction

Corollary 1

Proof:

Proposition 3 (High-Distortion Compression)

Proof:

Proposition 4 (Low-Distortion Compression)

Proof:

V Conclusions

Appendix A Asymptotic Analysis of the Traditional RD Function

A-1 High-Distortion Compression

A-2 Low-Distortion Compression

Appendix B Proof of Theorem 2

Appendix C Proof of Theorem 3

C-1 Proof of R∗⁢(D,P)≥R⁢(D,P)superscript𝑅𝐷𝑃𝑅𝐷𝑃R^{*}(D,P)\geq R(D,P)italic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_D , italic_P ) ≥ italic_R ( italic_D , italic_P )

C-2 Proof of R∗⁢(D,P)≤R⁢(D,P)superscript𝑅𝐷𝑃𝑅𝐷𝑃R^{*}(D,P)\leq R(D,P)italic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_D , italic_P ) ≤ italic_R ( italic_D , italic_P )

Appendix D Proof of Theorem 4

D-1 Case Where the Maximum for the Outer Optimization Occurs at ν1,ν2>0subscript𝜈1subscript𝜈20\nu_{1},\nu_{2}>0italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0

D-2 Case Where the Maximum for the Outer Optimization Occurs at ν1>0,ν2=0formulae-sequencesubscript𝜈10subscript𝜈20\nu_{1}>0,\nu_{2}=0italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0 , italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0

D-3 Case Where the Maximum for the Outer Optimization Occurs at ν1=0subscript𝜈10\nu_{1}=0italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0

Appendix E Proof of Theorem 5

Appendix F Proof of Theorem 6

Appendix G Proof of Corollary 1

Appendix H Asymptotic Analysis for Perceptually Perfect Reconstruction

H-1 High-Distortion Compression

H-2 Low-Distortion Compression

References

C-1 Proof of $R^{*}(D,P)\geq R(D,P)$

C-2 Proof of $R^{*}(D,P)\leq R(D,P)$

D-1 Case Where the Maximum for the Outer Optimization Occurs at $\nu_{1},\nu_{2}>0$

D-2 Case Where the Maximum for the Outer Optimization Occurs at $\nu_{1}>0,\nu_{2}=0$

D-3 Case Where the Maximum for the Outer Optimization Occurs at $\nu_{1}=0$