Rate-Distortion-Perception Tradeoff for Gaussian Vector Sources
Abstract
This paper studies the rate-distortion-perception (RDP) tradeoff for a Gaussian vector source coding problem where the goal is to compress the multi-component source subject to distortion and perception constraints. The purpose of imposing a perception constraint is to ensure visually pleasing reconstructions. This paper studies this RDP setting with either the Kullback-Leibler (KL) divergence or Wasserstein-2 metric as the perception loss function, and shows that for Gaussian vector sources, jointly Gaussian reconstructions are optimal. We further demonstrate that the optimal tradeoff can be expressed as an optimization problem, which can be explicitly solved. An interesting property of the optimal solution is as follows. Without the perception constraint, the traditional reverse water-filling solution for characterizing the rate-distortion (RD) tradeoff of a Gaussian vector source states that the optimal rate allocated to each component depends on a constant, called the water-level. If the variance of a specific component is below the water-level, it is assigned a zero compression rate. However, with active distortion and perception constraints, we show that the optimal rates allocated to the different components are always positive. Moreover, the water-levels that determine the optimal rate allocation for different components are unequal. We further treat the special case of perceptually perfect reconstruction and study its RDP function in the high-distortion and low-distortion regimes to obtain insight to the structure of the optimal solution.
Index Terms:
Rate-distortion-perception function, lossy source coding, lossy compression, Gaussian vector sources, reverse water-fillingI Introduction
The rate-distortion-perception (RDP) function is a generalization of Shannon’s rate-distortion function that incorporates an additional perception loss function which measures the distance between the distributions of the source and the reconstruction. It has been observed that in the neural compression framework [1, 2, 3, 4], improving realism in the reconstruction comes at the price of increased distortion. In this framework, realism is controlled by a perception loss function between the distributions of the source and the reconstruction, while distortion is controlled via a standard distortion loss function on the samples of the source and its reconstruction, e.g., in terms of mean squared error. The RDP function introduced in Blau and Michaeli [5] formalizes this tradeoff.
The extension of classical rate-distortion (RD) theory to incorporate constraints on the distribution of the reconstruction samples has been studied in various works in the information theory literature; see e.g., [6] and references therein. More recently, Theis and Wagner [7] present a one-shot coding theorem by means of the strong functional representation lemma (SFRL) [8] to establish the operational validity of the RDP function [5]. In [9], the authors establish analytic properties of the RDP function for the special case of (scalar) Gaussian sources, with a quadratic distortion function and a perception loss function of either Kullback–Leibler (KL) divergence or Wasserstein-2 distance between the source and the reconstruction distributions. The role of common randomness in the study of RDP function has been studied in [10, 11]. Furthermore, the distortion-perception tradeoff with a squared error distortion and Wasserstein-2 perception loss, but without an explicit compression rate constraint, has been studied in [12, 13], where it is shown that the entire tradeoff curve can be achieved by interpolating the two extremal reconstructions based on a given representation. Other related works include [14, 15].
This paper studies the RDP function of a Gaussian vector source under a squared error distortion and either KL divergence or Wasserstein-2 distance as the perception loss metric. Our result is thus an extension of prior work [9] on scalar Gaussian sources to the case of vector sources. We start by demonstrating the optimality of jointly Gaussian reconstructions for Gaussian vector sources in the RDP setting. We then show that by decomposing the Gaussian vector source using the unitary transformation obtained from the eigenvalue decomposition of its covariance matrix, it is possible to derive an achievable RDP function of the Gaussian vector source in term of the RDP functions of its constituent scalar components. The optimality of this achievable scheme can be established by a converse proof. This means that the characterization of the optimal RDP function can be formulated as an optimization problem. We explicitly derive the solution of the optimization problem and investigate structural properties of the optimal solution.
The optimal RDP function for the Gaussian vector source has the following interesting property. Without the perception constraint, the rate-distortion function of a parallel Gaussian source model has a classical reverse water-filling characterization [16, Thm 10.3], where the optimal rate allocation across the components is computed according to a distortion dependent parameter called water-level. A positive rate is assigned to those components that have a variance above this parameter. Any component whose variance is below the water-level has a zero rate; see Fig. 1(a). However, with a perception constraint, we observe a qualitatively different solution as shown in Fig. 1(b). First, unlike the case of reverse water-filling, the associated water-level for each component can be different and is characterized as a solution to a set of equations. Second, while reverse water-filling assigns zero rate to those source components whose variances are below the water-level, all components in the RDP setting are assigned a non-zero rate as long as both the distortion and perception constraints are active.
![Refer to caption](x1.png)
We further consider the special case of zero perception loss (so the source and reconstruction distributions are identical) and establish analytical results in this case. Moreover, we present asymptotic results on high and low distortion cases with zero perception, and shed additional insights into the difference between the RDP function and the RD function.
The rest of the paper is organized as follows. In Section II, we introduce the system model and some preliminaries. Some basics on the traditional reverse water-filling solution are provided in Section III. We discuss the generalized water-filling solution in Section IV for both KL-divergence and Wasserstein-2 distance as perception metrics; some properties of the RDP function are also discussed for perfect perceptual reconstruction; the asymptotic analysis is provided for both low and high distortion regimes.
Notation: We denote entropy, differential entropy and mutual information by , and , respectively. The cardinality of the set is written as . We use to denote the probability distribution function of a random vector . We use to denote the Gaussian distribution with mean and covariance matrix . We use to denote the expectation operator, and to denote the set of real numbers. Throughout this paper, the base of the logarithm function is .
II System Model and Preliminaries
Let be an -dimensional Gaussian vector source with mean and covariance matrix . Consider the eigenvalue decomposition of as follows:
(1) |
where is unitary and is a diagonal matrix of positive eigenvalues111 Note that if some of the eigenvalues are zero, the corresponding columns of the unitary matrix can be removed, and we have a diagonal of lower dimension. The rest of the derivations follows the same way.
(2) |
We assume that there is unlimited common randomness shared between the encoder and the decoder. Consider the following one-shot encoding and decoding functions where the source samples are encoded one at a time:
(3) | |||||
(4) |
Here, denotes the set of messages. Let be the distribution of the reconstruction induced by the encoding and decoding mechanisms. In this paper, we measure distortion using a squared-error loss function where . From a perceptual perspective, for given probability distributions and , we use to denote the perception loss function capturing the difference between the two distributions. For the two perception metrics that we consider in the following discussion, we have if and only if .
The above framework is referred to as the one-shot setting, because it compresses one sample at a time. We can also define the setting of encoding independently and identically distributed (i.i.d.) samples and reconstructing , and consider the asymptotic setting with .
Definition 1 (Operational RDP Functions)
Let . For given distortion-perception constraints , a rate is said to be achievable if there exist encoding and decoding functions satisfying
(5) | |||||
(6) | |||||
(7) |
where denotes the length of the message for encoding one sample. The infimum of all achievable rates is called the one-shot rate-distortion-perception (RDP) function, denoted as .
For the asymptotic setting, given distortion-perception constraints , a rate is said to be achievable if there exist encoding and decoding functions such that
(8) | |||||
(9) |
with the message that encodes satisfying
(10) |
The infimum of all achievable rates is called the asymptotic RDP function, denoted as .
Definition 2 (Information RDP Function)
For given , let be the set of conditional distributions such that for a fixed , we have
(11) |
The information rate-distortion-perception (RDP) function is defined as
(12) |
As explained in detail later, using the SFRL as in [8] and following similar steps to Theorem 2 and Theorem 5 in Appendix A.2 of [9], one can show that
(13) |
and
(14) |
Consequently, the one-shot operational RDP function is asymptotically close to the information RDP function and the asymptotic RDP function at high rate.
In the rest of the paper, the perception metric is assumed to be either the KL-divergence, i.e.,
(15) |
or the (squared) Wasserstein-2 distance, i.e.,
(16) |
where the infimum is taken over all joint distributions of with marginals and .
Before characterizing the RDP function, we first review the case of no perception constraint, which corresponds to traditional reverse water-filling for the classical rate-distortion function.
III Traditional Reverse Water-Filling
The classical rate-distortion theory for a parallel Gaussian source states that the optimal rate allocated to each component depends on a constant parameter, called water-level, as shown in Fig. 1(a). The water-level also represents the distortion allowed at those components whose variances are above the water-level. For a given distortion , let be the solution to the equation
(17) |
where . Now, let
(20) |
The rate-distortion function for the Gaussian vector source with variance for its -th component, , is as follows.
Theorem 1 (Thm 10.3 in [16])
For a Gaussian vector source, we have
(21) |
To simplify notation, we can redefine the water-level as in order to account for the components whose variances are below the water-level. If is below for some , then we set and assign zero rate to this component. Two special cases of the above theorem are of special interest.
Proposition 1 (High-Distortion Compression)
In the high-distortion regime, we have that for sufficiently small
(22) |
where . Let denote the set of indices where their corresponding eigenvalues are equal to . Then, the water-levels are given by
(23a) | |||||
(23b) |
Proof:
See Appendix A-1. ∎
The above proposition states that in the high-distortion compression, a positive rate is only assigned to the components with the largest eigenvalue.
Proposition 2 (Low-Distortion Compression)
In the low-distortion regime, we have that for a sufficiently small
(24) |
where the water-levels are given by
(25) |
Proof:
See Appendix A-2. ∎
For low-distortion compression, according to the above proposition, the same water-level is assigned to all components.
IV Rate-Distortion-Perception Function
IV-A Optimality of Gaussian Reconstruction
We first present a result indicating that for the two perception metrics (15) and (16) considered in this paper and for a Gaussian vector source, jointly Gaussian reconstruction is optimal.
Theorem 2
For a zero-mean Gaussian source , if the perception metric is either the KL-divergence or the Wasserstein-2 distance, without loss of optimality, in the optimization problem (12), we can restrict the reconstruction to have mean zero and be jointly Gaussian with .
Proof:
See Appendix B. ∎
A common property of the two perception metrics that enables the above theorem to hold is that if the source is Gaussian distributed, conditional Gaussian reconstruction minimizes both metrics among those with the same first- and second-order joint statistics. Theorem 2 implies that the optimization of RDP function can be restricted to jointly Gaussian distributions that satisfy the distortion and perception constraints.
IV-B RDP Function with KL Divergence as Perception Metric
In this section, we present the RDP function with the KL-divergence as the perception metric, i.e., . The results for the Wasserstein-2 distance as the perception metric is stated in the subsequent section. We present both one-shot and asymptotic RDP functions. As already mentioned, the one-shot RDP function is close to the information RDP function at high rate. Here we provide explicit constructions of both one-shot and asymptotic coding strategies for achieving (close to) .
The first step is to decompose the source using eigenvalue decomposition as in (1) and define
(26) |
The main idea is to construct a new Gaussian random vector and to use the channel simulation result of [8] to communicate to the decoder at a rate of . The new random vector is designed to be correlated with in a very specific way in order to satisfy the distortion and perception constraints and , respectively. The correlation between and is controlled by two sets of parameters, and , such that and . The optimal values of these parameters will be determined later.
In effect, instead of the classical rate-distortion setting where is chosen to minimize the rate subject to the distortion constraint, here we choose to satisfy both distortion and perception constraints. We construct this noisy version of at the decoder by taking advantage of the availability of common randomness.
Specifically, is a zero-mean random vector with a joint Gaussian distribution with such that for different , are mutually independent and
(27) |
With the above covariance structure, we can verify that is the minimum mean-squared error (MMSE) of estimating based on , i.e.,
(28) |
Now, to derive the one-shot RDP function , we can make use a consequence of the SFRL [8, Theorem 1] to show that when common randomness is available at both the encoder and decoder, there exists a channel simulation scheme that allows to be reconstructed at the decoder at a communication rate of
(29) |
After the reconstruction of at the decoder, we use the same unitary matrix to transform it into , i.e.,
(30) |
The above scheme leads to the one-shot rate, distortion, and perception loss for the -th component of as functions of , and as follows:
(31) | |||||
(32) | |||||
(33) |
This allows a characterization of an achievable one-shot RDP function of a Gaussian vector source as an optimization problem over and across its components.
For the asymptotic setting, the achievable scheme is identical, except that we compress a block of samples together. As , the logarithm and the constant terms in (31) can be neglected. This leads to an upper bound for , which is equal to . This upper bound turns out to be tight, i.e., a converse can be proved. This gives the following characterization of .
Theorem 3
Proof:
See Appendix C. ∎
An interpretation of the above is as follows. For a given , let and , , be the optimal solution to (34). Comparing this with (21), it can be seen that can be interpreted as the water-level for the -th component, which determines the rate allocated to that component according to (34a); see Fig. 1(b).
IV-C Generalized Water-filling with KL Divergence as Perception Metric
We now proceed to analyze the solution to the optimization program in Theorem 3. It can be shown that the optimization problem (34) is convex. Let , , , be nonnegative Lagrange multipliers. For , we have the first-order conditions:
(35) |
and
(36) |
We first focus on the most interesting regime where the distortion and the perception constraints are both active so , and , so that for all . In this case, (35) implies that can be expressed as
(37) |
Together with (36), this means that is the positive solution to the following equation
(38) |
which is quadratic in and can be solved analytically as follows:
There is an alternative expression for in term of that can be obtained by solving (37) as a quadratic equation in as below:
(40) |
This expression is useful later in Corollary 1.
The expressions (IV-C) and (37) give us the following generalized reverse water-filling interpretation of the optimal RDP solution. At given distortion constraint and perception constraint , each component of the source with variance is reconstructed by having a variance . Because is the variance of the MMSE estimate of given , this requires a rate of . The parameters and are chosen to satisfy the distortion and perception constraints. As already mentioned, can be thought of as the water-level, cf. (21).
When both the distortion and the perception constraints are active, i.e., , it is possible to prove (as shown in the theorem below) that
(41) |
so every component of the source is always allocated a non-zero rate regardless of the distortion constraint—unlike the traditional reverse water-filling solution, where a component may be allocated zero rate if its variance is below the water-level. Moreover, in contrast to the traditional reverse water-filling, the distortion of each component (i.e., ) may not be the same across the different components. So, an unequal-distortion allocation may be optimal when both perception and distortion constraints are active.
It is also possible that either the distortion or the perception constraint is not active. If the distortion constraint is active while the perception constraint is inactive, i.e., and , and for all , then (35) and (36) yield the traditional reverse water-filling solution. Specifically, the water-level is given by where satisfies the following:
(42) |
By redefining as , we see that the above expression is the same as (17).
If the distortion constraint is inactive, i.e., , based on (35), we have which yields
(43) |
This implies that every component of the source is assigned a zero rate if the distortion constraint is not active. The decoder simply generates the reconstruction independent of the source using a distribution that satisfies the perception constraint. Such a distribution may not be unique, as shown in the theorem below.
An interesting observation is that based on (41) and (43), we see that when the perception constraint is active, it is either that all the components are allocated positive rate, or that all the components are allocated zero rate. This means that the situation in the traditional reverse water-filling, where some of the water-levels are below the eigenvalues while others are equal to the eigenvalues, cannot happen, when the perception constraint is active.
The above discussion is summarized in the following.
Theorem 4
Let be a given distortion and perception constraints that are strictly feasible. The optimal solution of (34) with KL divergence as the perception metric is given as follows:
-
1.
If both the distortion and perception constraints are active222A constraint of a minimization problem is said to be inactive if the optimization problem with the same objective function but with the said constraint removed (while kee** all the other constraints) has at least one optimal solution that already satisfies all the original constraints., then there exist such that is as expressed in (IV-C) and is as expressed in (37). Here, and are chosen such that
(44) (45) In this case, every component has a positive rate.
-
2.
If the distortion constraint is active but the perception constraint is inactive, then there exists such that , and
(46) In this case, some components may have zero rate.
-
3.
If the distortion constraint is inactive, then , and can be any value in the set
(47) In this case, every component has zero rate.
Proof:
See Appendix D. ∎
IV-D RDP Function and Generalized Reverse Water-filling with Wasserstein-2 Distance as Perception Metric
Next, consider the Wasserstein-2 distance as the perception metric, i.e., . To that end, we have the following definitions for distortion and perception loss functions. Let the distortion loss function of the -th component be as in (32). Replace the perception loss function in (33) by the following:
(48) |
The following theorem characterizes the RDP function with Wasserstein-2 perception loss in terms of an optimization problem.
Theorem 5
Proof:
Similar to the KL-divergence case, the optimization program for the Wasserstein-2 distance is convex. For , we have the following first-order conditions:
(49) |
and
(50) |
Consider the case where both distortion and perception constraints are active, i.e., and for all . In this case, (49) and (50) yield the following solutions
(51) | |||||
(52) |
where is defined to be the unique solution of the following equation:
(53) |
As in the case of KL divergence, it is possible to prove that when both the distortion and the perception constraints are active we have . Thus, every component is compressed at a positive rate.
When the distortion constraint is active but the perception constraint is not active, the problem reduces to traditional reverse water-filling. Finally, when the distortion constraint is not active, i.e., , a zero rate is assigned to all components. This discussion is summarized in the following.
Theorem 6
Let be a given distortion and perception constraints that are strictly feasible. The optimal solution of (34) with the perception metric (33) replaced by (48) is given as follows:
- 1.
-
2.
If the distortion constraint is active but the perception constraint is inactive, then there exists such that , and
(56) In this case, some components may have zero rate.
-
3.
If the distortion constraint is inactive, then , and can be any value in the set
(57) In this case, every component has zero rate.
Proof:
See Appendix F. ∎
IV-E Perceptually Perfect Reconstruction
In this section, we focus on the special case of perfect perceptual quality, and study the properties of the RDP function with .
![Refer to caption](x2.png)
Corollary 1
The RDP function of a Gaussian vector source with is
(58) |
for some positive that satisfies
(59) |
where
(60) |
Proof:
See Appendix G. ∎
An interpretation of the optimal rate allocation in this case is as follows. By (58), the optimal rate allocated to the -th component is controlled by the expression . So, if a component has a larger variance, it is compressed at a higher rate. Further, by (60) it also has a higher water-level.
Under general perception and distortion constraints, the encoding and decoding strategy adopted in this paper (which involves constructing as in (27)) can be thought of as first compressing each component of the source at an individual rate specified by the distortion level based on the conventional rate-distortion tradeoff, then scaling the compressed source to a variance of to satisfy the perception constraint. For the perfect perception case with , the compression rate becomes (58) and the distortion level becomes (60); further, each component of the compressed signal is simply scaled to match the variance of the source in order to ensure zero perception loss. The distortion after scaling is given by (59). This is shown in Fig. 2.
We further note that at a fixed , the rate allocated to each component is in general different for different tradeoff points. Whereas for the scalar Gaussian source, a universal representation for different points at a fixed is possible via scaling [9], for the Gaussian vector source such universal representation does not exist, due to the different rate allocations in each component at different tradeoff points.
Next, we investigate the asymptotic behavior of the compression rate and the distortion level in the perfect perception case.
Proposition 3 (High-Distortion Compression)
In the high-distortion and perfect perception regime, we have that for sufficiently small ,
(61) |
where the water-levels are given by
(62) |
Proof:
See Appendix H-1.∎
Here, we express in term of deviation from the maximum distortion at perfect perception at zero rate. This maximum distortion can be shown to be , which is twice of the total variance of the source [9], because at zero rate the decoder should simply generate an independent Gaussian random vector with the same covariance matrix. Comparing of Proposition 3 with in Proposition 1, it is interesting to see that the variances of the source enter as which is the sum of the square of the variances over all the components. This is in contrast to the corresponding factor in in the traditional reverse water-filling solution which is simply . This is a consequence of the perfect perception constraint, which requires all the components to be reconstructed with the same variances as the source at the decoder.
Proposition 4 (Low-Distortion Compression)
In the low-distortion and perfect perception regime, we have that for sufficiently small ,
(63) |
where the water-levels are given by
(64) |
Proof:
See Appendix H-2. ∎
Comparing Proposition 4 with Proposition 2, we see that in this high-rate low-distortion regime, the extra rate required to satisfy zero-perception scales as
(65) | |||||
(66) |
![Refer to caption](x3.png)
![Refer to caption](x4.png)
![Refer to caption](x5.png)
![Refer to caption](x6.png)
Fig. 3 shows the water-levels of different components for both low-distortion and high-distortion compression with or for an example of a Gaussian vector source. The water-levels determine the compression rates assigned to each component.
In Fig. 3(a), for high-distortion compression with no perception constraint, all components except the one with the largest eigenvalue are allocated a zero compression rate (cf. Proposition 1). With an active perception constarint, as shown in Fig. 3(c) for the case, all components are allocated positive rates (cf. Proposition 3).
In Fig. 3(b), for low-distortion compression with no perception constraint, the water-levels of all components are the same (cf. Proposition 2). At low distortion and with an active perception constraint, as shown in Fig. 3(d) for the case, the water-levels of different components are approximately equal with some slight differences which are determined by (64) in Proposition 4. Therefore, in the low-distortion regime, the water-levels of all components are approximately the same regardless of the perception constraint.
V Conclusions
This paper characterizes the RDP function for a Gaussian vector source. In contrast to the traditional reverse water-filling solution (without a perception constraint), the water-levels assigned to different components are not necessarily equal. When both distortion and perception constraints are active, every component is assigned a positive rate. These results have implications to perception-aware image coding.
Appendix A Asymptotic Analysis of the Traditional RD Function
A-1 High-Distortion Compression
Here, we consider for sufficiently small . Without loss of generality, we assume that the eigenvalues are ordered as follows
(67) |
First consider the case that . The distortion constraint (17) implies that
(68) |
The above condition implies that for a small enough , should satisfy
(69) |
Considering (69) with (68) yields
(70) |
Plugging the above into the RDP function of Proposition 1, we get
(71) | |||||
(72) |
Finally, noting gives (22).
If , then similar to the above discussion, all eigenvalues except the largest ones are assigned a zero compression rate and for the maximum eigenvalues, we have the following water-level
(73) |
and the following rate
(74) | |||||
(75) |
This proves (22) for arbitrary .
A-2 Low-Distortion Compression
Consider the case of for sufficiently small . In this low-distortion regime, the constant water-level is not saturated by the eigenvalues. Thus, Proposition 1 simplifies to the following
(76) |
Also, the distortion constraint (17) implies that
(77) |
Combining (76) and (77), we get the rate expression (24) in Proposition 2.
Appendix B Proof of Theorem 2
First, we prove the optimality of Gaussian reconstruction for the case of the KL-divergence as the perception metric. Define the following distribution
(78) |
Now, let be a random variable jointly Gaussian distributed with such that
(79a) | |||||
(79b) |
We proceed with lower bounding the rate as follows
(80) | |||||
(81) | |||||
(82) |
where (81) follows from (79) and the fact that under a fixed covariance matrix, a jointly Gaussian distribution maximizes the conditional differential entropy [17, Lemma 2]. The condition (79) also implies that for the distortion loss, we have
(83) |
Moreover, for the perception loss, we have
(84) | |||||
(85) | |||||
(86) | |||||
(87) | |||||
(88) | |||||
(89) | |||||
(90) |
where (87) follows because the expression for a vector only contains the terms such as , and for , and since according to (79), has the same mean and covariance matrix as , the expected values of these terms with respect to are equal to the same expectations calculated with respect to ; (89) follows because for a fixed covariance matrix, the differential entropy is maximized by a Gaussian distribution [16, Thm 8.6.5]. Finally, there is no loss of optimality in setting since replacing with does not increase , , and .
Thus, replacing by does not increase the rate, while distortion and perception constraints remain to be satisfied. Thus, the optimal must be jointly Gaussian with .
For the case of the Wasserstein-2 distance as the perception metric, lower bounding steps for and are the same as (82) and (83), respectively. For the perception metric, the steps are refined as follows. Define the following distribution
(91) |
Now, define to be a joint Gaussian distribution such that
(92a) | |||||
(92b) | |||||
(92c) |
Then, we have the following set of inequalities:
(93) | |||||
(94) | |||||
(95) | |||||
(96) | |||||
(97) | |||||
(98) | |||||
(99) |
where
- •
- •
-
•
(98) follows because and , which are justified as follows. First, notice that both and are Gaussian distributions. According to (92), the first- and second-order statistics of are equal to those of . Also, from (91), we know that , hence the first- and second-order statistics of and are the same. On the other side, from (79), we know that the first- and second-order statistics of are equal to those of . Thus, we conclude that . A similar argument shows that .
Thus, without loss of optimality one can replace by since the rate does not increase, while the distortion and perception constraints remain to be satisfied.
Appendix C Proof of Theorem 3
We aim to establish the RDP function for the case of KL-divergence as the perception metric by showing that
(100) |
where
(101a) | |||
(101b) | |||
(101c) | |||
(101d) | |||
(101e) |
C-1 Proof of
Let be the optimal solution of (101). For , let be jointly Gaussian with with their covariance matrix as given in (27), and be independent of all other , i.e., . Let . Further, set . It can be verified that
(102) | |||||
(103) | |||||
(104) | |||||
(105) |
and
(106) | |||||
(107) | |||||
(108) | |||||
(109) |
where (102) and (106) are due to the invariance of KL-divergence and Euclidean distance under unitary transformations. Therefore, we must have . On the other hand,
(110) | |||||
(111) | |||||
(112) | |||||
(113) |
This proves .
C-2 Proof of
It follows from Theorem 2 that
(114a) | |||||
s.t. | (114c) | ||||
where has mean zero and is jointly Gaussian with . Let be the optimal distribution of the program in (114) and define . Let be the covariance matrix of and be a diagonal matrix whose diagonal elements coincide with those of , i.e.,
(115) |
Furthermore, define
(116) |
It can be verified that
(117) | |||||
(118) | |||||
(119) | |||||
(120) | |||||
(121) | |||||
(122) | |||||
(123) | |||||
(124) |
where
Next, consider the expected distortion loss as follows:
(125) | |||||
(126) | |||||
(127) | |||||
(128) | |||||
(129) |
where
Appendix D Proof of Theorem 4
First, we show that the optimization problem in (101) is convex. The second derivative of the objective function (101a) with respect to is which is positive. The second derivative of the function in the constraint (101e) with respect to is which is again positive. It just remains to study the constraint (101d). The Hessian matrix of the function in this constraint is
(134) |
The determinant of the above matrix is zero, and the matrix has positive diagonal terms. Thus, it is a positive semidefinite matrix, which implies the convexity of the associated function. This proves the convexity of the program in (101).
Since the is assumed to be strictly feasible, the Slater’s condition is satisfied. This implies that the solution to this problem is equal to that of the following dual optimization problem
where and are nonnegative Lagrange multipliers. Note that the distortion function has implicit constraints and . Moreover, the derivatives of the respective terms go to infinity when and approach these boundaries. For this reason, we cannot immediately write down the Karush-Kuhn-Tucker (KKT) conditions for the optimization problem, and instead, need to carefully consider the behaviour of the optimization problem close to these boundaries. Toward this end, we consider the following three different cases.
D-1 Case Where the Maximum for the Outer Optimization Occurs at
This is the case where both perception and distortion constraints are active. Let and be the optimal solution to the inner minimization problem in (LABEL:Lagrange-dual-function) for the optimal and . We first note that
(136) |
This is because if , then we have which would violate the perception constraint.
Next, we show that the following strict inequality holds:
(137) |
Suppose that the above strict inequality does not hold, i.e., . We show that such cannot be the optimal solution to the inner minimization problem.
The Lagrangian term in (LABEL:Lagrange-dual-function) depends on and through the following function:
(138) | |||||
Fix . When we deviate from to for some small , the first order change in can be seen as follows:
(139) | |||||
(140) | |||||
(141) |
where we use the fact that for small . Thus if , since , for sufficiently small , we can strictly decrease , while satisfying the implicit constraints. This contradicts the assumption that is the optimal solution to the inner minimization problem. This proves (137), which implies that every component has positive rate.
The strict inequalities in (137) and (136) imply that in this case, the optimal solution occurs at the interior of the set . This allows us to write down the KKT conditions for the optimal primal variables and the optimal dual variables and as follows:
(142a) | |||||
(142b) | |||||
(142c) | |||||
(142d) | |||||
(142e) | |||||
(142f) |
along with primal and dual feasibility constraints, i.e., , and (34e)-(34e).
Due to the strict inequalities (137) and (136), we have that and . Then, from condition (142a), we can write as follows
(143) |
Plugging (143) into (142b) yields the following second-order equation in
(144) |
Note that as varies from to , the left-hand side of (144) decreases monotonically from to while the right-hand side of (144) increases monotonically from to So, this equation has a unique solution in the interval . The equation (144) is quadratic, so it can solved analytically. The solution gives (IV-C) and (37).
D-2 Case Where the Maximum for the Outer Optimization Occurs at
This is the case where the distortion metric is active but the perception metric is inactive. Clearly, this reduces to the traditional rate-distortion function.
D-3 Case Where the Maximum for the Outer Optimization Occurs at
This is the case where the distortion metric is inactive, so the inner minimization problem in (LABEL:Lagrange-dual-function) decouples into two independent minimizations, one for and the other one for , i.e.,
(145) |
For the first optimization problem in (145), its KKT conditions are given by
(146) | |||||
(147) |
The above two conditions imply that
(148) |
So each component has zero rate.
For the second minimization problem in (145), this is the Lagrangian dual of a feasibility problem with the perception constraint only. Thus, we can choose to satisfy the primal constraints:
(149) |
Note that despite that the distortion constraint is already assumed to be inactive, we still need to impose an additional distortion constraint on :
(150) |
This is because not all ’s satisfying (149) satisfy the constraint (150). A constraint being inactive simply means that if the constraint is removed, there is already at least one optimal solution that automatically satisfies the constraint. In this case, there are multiple optimal solutions, all giving the same objective value (of zero rate). So we need to restrict to the ones that satisfy (150). Note that the left-hand side of (150) is the distortion of the reconstruction at zero rate.
Appendix E Proof of Theorem 5
We now establish the RDP Function with the Wasserstein-2 distance as the perception metric. The proof follows similar steps to those of the KL-divergence metric in Appendix C. We just need to rewrite the lower bounding steps for the perception metric. Let be the optimal conditional distribution of the following optimization program
(151a) | |||||
s.t. | (151c) | ||||
where has mean zero and is jointly Gaussian with . Let and be the covariance matrix of and be a diagonal matrix whose diagonal elements coincide with those of , i.e.,
(152) |
The lower bounding steps for the perception metric are as follows
(153) | |||||
(154) | |||||
(155) | |||||
(156) | |||||
(157) | |||||
(158) | |||||
(159) | |||||
(160) | |||||
(161) | |||||
(162) |
where
-
•
(154) follows because the trace is invariant under unitary transformations;
- •
-
•
(156) follows because ;
-
•
(159) follows from the definitions and ;
-
•
(160) follows from the tensorization property of Wasserstein-2 distance, i.e., for given distributions and , we have ;
- •
On the other hand, the inequality in (160) becomes an equality if with constructed in such a way that , , are mutually independent and their covariance matrices are given by (27). Thus, the RDP function for the Wassertein-2 distance as perception metric is given by the following optimization problem:
(163a) | |||
(163b) | |||
(163c) | |||
(163d) | |||
(163e) |
Appendix F Proof of Theorem 6
First, note that the optimization problem is convex for the Wasserstein-2 distance as justified below. The argument for the rate and distortion constraints is the same as the KL-divergence metric. The second derivative of the perception constraint in (163e) with respect to is , which is positive.
The optimization problem can be analyzed in the same way as in Appendix D, except the case of , which is discussed as follows. Here, we need a different proof to show the inequality
(164) |
(The proof uses the same technique as the one showing in Appendix D-1.) Consider the following Lagrange dual optimization
(165) | |||||
Suppose that the strict inequality in (164) does not hold, i.e., . We show that such cannot be the optimal solution to the inner minimization problem.
The Lagrangian term in (165) depends on and through the following function:
(166) | |||||
We fix and then deviate from to for some small . The first order change in can be seen as follows:
(167) | |||||
(168) |
Thus, if , for sufficiently small , we can strictly decrease , while satisfying the implicit constraints. This contradicts with the assumption that is the optimal solution to the inner minimization problem. This proves (164). Given the strict inequality in (164), similar to the KL-divergence metric, we can show that
(169) |
The strict inequalities in (169) and (164) imply that each component has a positive rate, and further . Thus, we can write down the following KKT conditions
(170a) | |||||
(170b) | |||||
(170c) | |||||
(170d) |
The derivation of the optimal solution can now be shown as follows. Define
(171) |
Plugging the above definition into (170b) yields
(172) |
Also, from (170a), we get
(173) |
Plugging (172) and (173) into (171), we get the following equation:
(174) |
Note that the function is an increasing function in . Also, the function as defined in is a decreasing function in . So, the solution to the above equation is unique.
Appendix G Proof of Corollary 1
If , this falls under the first case in Theorem 4 and Theorem 6. Here, we have
(175) |
The perception constraint (45) and (55) with implies that for every . Now, using the expression of optimal in (40) together with , we have
(176) |
where is chosen to satisfy the distortion constraint (44) and (54), i.e.,
(177) |
Combining the above proves the desired result.
Appendix H Asymptotic Analysis for Perceptually Perfect Reconstruction
We utilize the optimal solution for the perceptually perfect reconstruction case in Corollary 1, i.e., (175), (176) and (177).
H-1 High-Distortion Compression
Let for some small . Note that by (177), this means that we are setting to be
(178) |
In this case, should be close to , and the rate is close to zero. By (176), this also means that must be close to zero. Then, we can approximate as follows:
(179) | |||||
(180) | |||||
(181) |
Plugging the above into (178) yields
(182) |
The rate expression can now be approximated as follows
(183) | |||||
(184) | |||||
(185) |
Now, using (182) and (185) to eliminate , we get
(186) |
To derive the expression for the water-level, we use (182) in (181) to get
(187) |
H-2 Low-Distortion Compression
Let for some small . Note that as , we must have by (177), and consequently by (176). In this regime, we can approximate the water-levels in (176) as follows
(188) | |||||
(189) |
Plugging (189) into the distortion constraint (177), we have
(190) | |||||
(191) |
which implies
(192) |
Substituting (192) into (189) shows that the water-levels in the low-distortion regime are given by
(193) |
The rate expression can now be approximated as follows
(194) | |||||
(195) | |||||
(196) | |||||
(197) |
This concludes the proof.
References
- [1] E. Agustsson, M. Tschannen, F. Mentzer, R. Timofte, and L. Van Gool, “Generative adversarial networks for extreme learned image compression,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 221–231.
- [2] J. Ballé, V. Laparra, and E. P. Simoncelli, “End-to-end optimized image compression,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2017, pp. 1–27.
- [3] L. Theis, W. Shi, A. Cunningham, and F. Huszár, “Lossy image compression with compressive autoencoders,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2017.
- [4] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. V. Gool, “Conditional probability models for deep image compression,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 4394–4402.
- [5] Y. Blau and T. Michaeli, “Rethinking lossy compression: The rate-distortion-perception tradeoff,” in Proc. ACM Int. Conf. Mach. Learn. (ICML), 2019, pp. 675–685.
- [6] N. Saldi, T. Linder, and S. Yüksel, “Output constrained lossy source coding with limited common randomness,” IEEE Trans. Inf. Theory, vol. 61, no. 9, pp. 4984–4998, Jun. 2015.
- [7] L. Theis and A. Wagner, “A coding theorem for the rate-distortion-perception function,” in Neural Compression Workshop of Int. Conf. Learn. Represent. (ICLR), 2021, p. 9.
- [8] C. T. Li and A. El Gamal, “Strong functional representation lemma and applications to coding theorems,” IEEE Trans. Inf. Theory, vol. 64, no. 11, pp. 6967–6978, Nov. 2018.
- [9] G. Zhang, J. Qian, J. Chen, and A. Khisti, “Universal rate-distortion-perception representations for lossy compression,” in Proc. Adv. Neural Inf. Process. Sys. (NeurIPS), 2021, pp. 11 517–11 529.
- [10] A. B. Wagner, “The rate-distortion-perception tradeoff: The role of common randomness,” arXiv:2202.04147, 2022.
- [11] J. Chen, L. Yu, J. Wang, W. Shi, Y. Ge, and W. Tong, “On the rate-distortion-perception function,” IEEE J. Sel. Areas Inf. Theory, vol. 3, no. 4, pp. 664–673, Dec. 2022.
- [12] D. Freirich, T. Michaeli, and R. Meir, “A theory of the distortion-perception tradeoff in Wasserstein space,” Proc. Adv. Neural Inf. Process. Sys. (NeurIPS), vol. 34, pp. 25 661–25 672, 2021.
- [13] Z. Yan, F. Wen, R. Ying, C. Ma, and P. Liu, “On perceptual lossy compression: The cost of perceptual reconstruction and an optimal training framework,” in Proc. ACM Int. Conf. Mach. Learn. (ICML), 2021, pp. 11 682–11 692.
- [14] H. Liu, G. Zhang, J. Chen, and A. Khisti, “Lossy compression with distribution shift as entropy constrained optimal transport,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2022, pp. 1–6.
- [15] S. Salehkalaibar, B. Phan, J. Chen, W. Yu, and A. Khisti, “On the choice of perception loss function for learned video compression,” in Proc. Adv. Neural Inf. Process. Sys. (NeurIPS), 2023.
- [16] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd Ed. Wiley, 2006.
- [17] L. Song, J. Chen, and C. Tian, “Broadcasting correlated vector gaussians,” IEEE Trans. Inf. Theory, vol. 61, no. 5, pp. 2465–2477, May 2015.