1]Meta AI 2]Mila - Québec AI Institute, Montreal, Quebec, Canada 3]Department of Computer Science, McGill University, Montreal, Quebec, Canada 4]Canada CIFAR AI Chair \correspondenceJonathan Lebensold at , Chuan Guo at \codehttps://github.com/facebookresearch/dp_rdm

DP-RDM: Adapting Diffusion Models to Private Domains Without Fine-Tuning

Jonathan Lebensold    Maziar Sanjabi    Pietro Astolfi    Adriana Romero-Soriano    Kamalika Chaudhuri    Mike Rabbat    Chuan Guo [ [ [ [ [email protected] [email protected]
(May 14, 2024)
Abstract

Text-to-image diffusion models have been shown to suffer from sample-level memorization, possibly reproducing near-perfect replica of images that they are trained on, which may be undesirable. To remedy this issue, we develop the first differentially private (DP) retrieval-augmented generation algorithm that is capable of generating high-quality image samples while providing provable privacy guarantees. Specifically, we assume access to a text-to-image diffusion model trained on a small amount of public data, and design a DP retrieval mechanism to augment the text prompt with samples retrieved from a private retrieval dataset. Our differentially private retrieval-augmented diffusion model (DP-RDM) requires no fine-tuning on the retrieval dataset to adapt to another domain, and can use state-of-the-art generative models to generate high-quality image samples while satisfying rigorous DP guarantees. For instance, when evaluated on MS-COCO, our DP-RDM can generate samples with a privacy budget of ϵ=10italic-ϵ10\epsilon=10italic_ϵ = 10, while providing a 3.53.53.53.5 point improvement in FID compared to public-only retrieval for up to 10,0001000010,00010 , 000 queries.

[Uncaptioned image]
Figure 1: Samples generated by our differentially private retrieval-augmented diffusion model (DP-RDM), which was trained on face-blurred ImageNet, using different private retrieval datasets at inference time: Shutterstock, MS-COCO with face-blurring (FB), ImageNet FB, and CIFAR-10. We calibrated the noise added in each row for a privacy budget of ϵ=10italic-ϵ10\epsilon=10italic_ϵ = 10 after answering 1,00010001,0001 , 000 queries. Each query uses k=18𝑘18k=18italic_k = 18 neighbors for retrieval augmentation and a 0.1%-0.3% random subset of the retrieval dataset. The differences between the generated images show how swap** the private retrieval dataset changes the distribution of the generated images to adapt to a given data domain, e.g., the differences in laptop and smart watch, or missing concepts such as goose, llama, smart watch and laptop in CIFAR-10.

1 Introduction

Text-to-image diffusion models (Ho et al., 2020; Song et al., 2020) enable highly customizable image generation, producing photo-realistic image samples that can be instructed through text prompting. However, recent studies also showed that it is possible to prompt text-to-image diffusion models to produce near-perfect replica of some of their training samples (Somepalli et al., 2023; Carlini et al., 2023).

One promising strategy for mitigating this risk is through differential privacy (DP; Dwork et al. (2006)). In essence, if a generative model satisfies DP, then its generated samples cannot critically depend on any individual training sample, therefore preventing the model from replicating any training image in a provable manner. Currently, the state-of-the-art in DP image generation focuses on adaptation through fine-tuning (Ghalebikesabi et al., 2023; Lyu et al., 2023), where a diffusion model is first trained on public data (e.g., a licensed dataset) and then fine-tuned on a private dataset using DP-SGD (Abadi et al., 2016). This approach, although effective at a small scale, suffers immensely when scaling up the model and the private dataset due to the computational cost of DP fine-tuning (Sander et al., 2023; Mehta et al., 2022). To date, most studies on DP image generation are limited to simple image datasets such as MNIST and CIFAR-10 (Dockhorn et al., 2022).

To address this limitation, we adopt a different approach for private image generation through differentially private retrieval augmentation. Retrieval-augmented generation (RAG; Lewis et al. (2020)) is an emerging paradigm in generative modeling that performs generation in a semi-parametric manner. To generate a sample, the model utilizes both its learned parameters as well as a retrieval dataset that consists of an arbitrary set of images. In effect, the model can adapt to different domains at generation time by changing the retrieval dataset without fine-tuning. The modularity of RAG makes it ideal for privacy-sensitive applications, where the sensitive data can be stored in the retrieval dataset for more fine-grained control of information leakage.

We utilize this desirable property of RAG and define a differentially private retrieval-augmented diffusion model (DP-RDM) that is capable of generating high-quality images based on text prompts while satisfying rigorous DP guarantees. To this end, we first demonstrate that without any privacy protection, retrieval-augmented diffusion models are prone to copying artifacts from samples in the retrieval dataset. Then, we propose a private retrieval mechanism that adds calibrated noise to the retrieved samples, along with modifications to an existing retrieval-augmented diffusion model architecture (Blattmann et al., 2022) to accommodate this mechanism. As a result, our DP-RDM is capable of generating high-quality samples from complex image distributions at a moderate privacy cost of ϵ=10italic-ϵ10\epsilon=10italic_ϵ = 10; see Figure 1 for example. Notably, DP-RDM can potentially work with any state-of-the-art retrieval-augmented image generation model, regardless of the number of parameters and output resolution. To the best of our knowledge, this is the first work to show privacy risks of using private data in the retrieval dataset for RAG and propose a DP solution for adapting text-to-image diffusion models to sensitive target domains. Our approach holds the potential of democratizing the adoption of such models in privacy-sensitive applications since it does not require any costly fine-tuning, nor does it require sharing sensitive data with a fine-tuning party.

Contributions.

Our main contributions are the following.

  1. 1.

    We demonstrate how retrieval-augmented diffusion models without any privacy protection can leak sample-level information from their retrieval dataset in the worst case.

  2. 2.

    We propose a differentially private retrieval-augmented diffusion model (DP-RDM) that provably mitigates this information leakage. DP-RDM utilizes a DP retrieval mechanism based on private k𝑘kitalic_k-NN (Zhu et al., 2020) and adapts the existing retrieval-augmented diffusion model architecture (Blattmann et al., 2022) to this mechanism.

  3. 3.

    We evaluate our DP-RDM on three datasets—CIFAR-10, MS-COCO and Shutterstock, a privately licensed dataset of 239M image-captions pairs—and show that it can effectively adapt to these datasets privately with minor loss in generation quality. On MS-COCO, our model is able to generate up to 10,0001000010,00010 , 000 images at a privacy cost of ϵ=10italic-ϵ10\epsilon=10italic_ϵ = 10 while achieving an FID of 10.910.910.910.9 (lower is better). In comparison, using only the public retrieval dataset yields an FID of 14.414.414.414.4 with the same model.

2 Background

We first present the necessary background on diffusion models, retrieval-augmented generation and differential privacy.

2.1 Diffusion Models and Retrieval Augmented Generation

Diffusion models (Ho et al., 2020; Song et al., 2020) represent a leap of research progress in generative modeling yielding photorealistic generated images. During training, an image is corrupted with noise (i.e., forward process), and fed to a model (neural network), whose learning objective is to predict the noise in order to perform denoising (i.e., reverse process). At generation time, the model produces an image sample from noise through iterative denoising. Although this denoising process can be carried out in the raw pixel space, a more effective approach is to perform it in the latent space of a pre-trained autoencoder, giving rise to the so-called latent diffusion models (LDMs; Rombach et al. (2022)). Another popular extension to diffusion models is text-to-image diffusion models (Rombach et al., 2022; Saharia et al., 2022), where the denoising process is conditioned on text captions from paired image-text datasets. In effect, the model learns to produce images that are relevant to a given text prompt.

Retrieval augmented generation (RAG). While text-to-image generative models have achieved remarkable results, one major drawback is their limited generation controllability. Retrieval augmentation (Sheynin et al., 2022; Chen et al., 2022; Blattmann et al., 2022; Yasunaga et al., 2022) is an emerging paradigm that provides more fine-grained control through the use of a retrieval dataset. In retrieval augmented generation (RAG), the model is trained to condition on both the text prompt and samples retrieved from the retrieval dataset, making it possible to steer generation by modifying the retrieved samples. Aside from controllable generation, RAG has additional benefits: 1. By scaling up the retrieval dataset, one can surpass the typical learning capacity limits of parametric models and obtain higher-quality samples without increase the model size (Blattmann et al., 2022). 2. By using retrieval datasets from different image domains, the model can adapt to a new target domain without fine-tuning (Sheynin et al., 2022; Casanova et al., 2021). 3. It is easy to satisfy security and privacy requirements such as information flow control (Tiwari et al., 2023; Wutschitz et al., 2023) and right-to-be-forgotten (Ginart et al., 2019; Guo et al., 2019; Bourtoule et al., 2021) for samples in the retrieval dataset.

2.2 Differential Privacy

Throughout we will consider randomized mechanisms operating on a dataset D𝐷Ditalic_D in some data space 𝒳𝒳\mathcal{X}caligraphic_X. Two datasets D𝐷Ditalic_D and Dsuperscript𝐷D^{\prime}italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are said to be neighboring, denoted DDsimilar-to-or-equals𝐷superscript𝐷D\simeq D^{\prime}italic_D ≃ italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, if they differ in a single data point either in addition, removal or replacement of some xD𝑥𝐷x\in Ditalic_x ∈ italic_D.

Definition 2.1 (DP; Dwork et al. (2006)).

Let ϵ0italic-ϵ0\epsilon\geq 0italic_ϵ ≥ 0 and δ0𝛿0\delta\geq 0italic_δ ≥ 0. A randomized mechanism M:𝒳𝒪:𝑀𝒳𝒪M:\mathcal{X}\rightarrow\mathcal{O}italic_M : caligraphic_X → caligraphic_O, is (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-DP if for every pair of neighboring datasets DDsimilar-to-or-equals𝐷superscript𝐷D\simeq D^{\prime}italic_D ≃ italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and every subset S𝒪𝑆𝒪S\subseteq\mathcal{O}italic_S ⊆ caligraphic_O, we have:

Pr[M(D)S]eϵPr[M(D)S]+δ.Pr𝑀𝐷𝑆superscript𝑒italic-ϵPr𝑀superscript𝐷𝑆𝛿\displaystyle\Pr[M(D)\in S]\leq e^{\epsilon}\Pr[M(D^{\prime})\in S]+\delta\enspace.roman_Pr [ italic_M ( italic_D ) ∈ italic_S ] ≤ italic_e start_POSTSUPERSCRIPT italic_ϵ end_POSTSUPERSCRIPT roman_Pr [ italic_M ( italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ italic_S ] + italic_δ .

The privacy guarantee we offer relies on Rényi DP, a generalization of DP measured in terms of the Rényi divergence.

Definition 2.2 (Rényi divergence; Rényi (1961)).

Let α>1𝛼1\alpha>1italic_α > 1. The Rényi divergence of order α𝛼\alphaitalic_α between two probability distributions P𝑃Pitalic_P and Q𝑄Qitalic_Q on 𝒳𝒳\mathcal{X}caligraphic_X is defined by:

𝔻α(P||Q)1α1log𝔼oQ[P(o)Q(o)]α.\mathbb{D}_{\alpha}(P||Q)\triangleq\frac{1}{\alpha-1}\log\mathbb{E}_{o\sim Q}% \left[\frac{P(o)}{Q(o)}\right]^{\alpha}.roman_𝔻 start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_P | | italic_Q ) ≜ divide start_ARG 1 end_ARG start_ARG italic_α - 1 end_ARG roman_log roman_𝔼 start_POSTSUBSCRIPT italic_o ∼ italic_Q end_POSTSUBSCRIPT [ divide start_ARG italic_P ( italic_o ) end_ARG start_ARG italic_Q ( italic_o ) end_ARG ] start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT .
Definition 2.3 (Rényi DP; Mironov (2017)).

Let α>1𝛼1\alpha>1italic_α > 1 and ϵ0italic-ϵ0\epsilon\geq 0italic_ϵ ≥ 0. A randomized mechanism M𝑀Mitalic_M is (α,ϵ)𝛼italic-ϵ(\alpha,\epsilon)( italic_α , italic_ϵ )-RDP if for all DDsimilar-to-or-equals𝐷superscript𝐷D\simeq D^{\prime}italic_D ≃ italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we have 𝔻α(M(D)M(D))ϵsubscript𝔻𝛼conditional𝑀𝐷𝑀superscript𝐷italic-ϵ\mathbb{D}_{\alpha}(M(D)\|M(D^{\prime}))\leq\epsilonroman_𝔻 start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_M ( italic_D ) ∥ italic_M ( italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ≤ italic_ϵ.

A Rényi DP guarantee can be translated to an (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-DP guarantee through conversion (Balle et al., 2020). Rényi DP also benefits from simplified privacy composition, whereby the privacy budget for multiple invocations can be added together. Finally, Rényi DP inherits the post-processing property of DP, meaning that future computation over a private result cannot be made less private (Mironov, 2017).

DP mechanisms.

In this work, we mainly consider the Sampled Gaussian Mechanism (SGM) which is an extension of the Gaussian Mechanism.

Definition 2.4 (Gaussian Mechanism).

Let f:𝒳d:𝑓𝒳superscript𝑑f:\mathcal{X}\to\mathbb{R}^{d}italic_f : caligraphic_X → roman_ℝ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT be a query with global sensitivity Δ=supDDf(D)f(D)2Δsubscriptsupremumsimilar-to-or-equals𝐷superscript𝐷subscriptnorm𝑓𝐷𝑓superscript𝐷2\Delta=\sup_{D\simeq D^{\prime}}\|f(D)-f(D^{\prime})\|_{2}roman_Δ = roman_sup start_POSTSUBSCRIPT italic_D ≃ italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ italic_f ( italic_D ) - italic_f ( italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and Z𝒩(0,σ2I)similar-to𝑍𝒩0superscript𝜎2𝐼Z\sim\mathcal{N}(0,\sigma^{2}I)italic_Z ∼ caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ). The Gaussian Mechanism, defined as M(D)=f(D)+Z𝑀𝐷𝑓𝐷𝑍M(D)=f(D)+Zitalic_M ( italic_D ) = italic_f ( italic_D ) + italic_Z, satisfies (α,ϵ)𝛼italic-ϵ(\alpha,\epsilon)( italic_α , italic_ϵ )-RDP with ϵ=αΔ22σ2italic-ϵ𝛼superscriptΔ22superscript𝜎2\epsilon=\frac{\alpha\Delta^{2}}{2\sigma^{2}}italic_ϵ = divide start_ARG italic_α roman_Δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG for every α>1𝛼1\alpha>1italic_α > 1 (Mironov, 2017).

Given a large dataset, it is also possible to amplify the privacy guarantee by outputting a noisy answer on a subset of the dataset using SGM.

Definition 2.5 (SGM; Mironov et al. (2019)).

Let f:𝒳d:𝑓𝒳superscript𝑑f:\mathcal{X}\to\mathbb{R}^{d}italic_f : caligraphic_X → roman_ℝ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT be a function operating on subsets of D𝐷Ditalic_D. The Sampled Gaussian Mechanism (SGM), with sampling rate q(0,1]𝑞01q\in(0,1]italic_q ∈ ( 0 , 1 ] and noise σ>0𝜎0\sigma>0italic_σ > 0 is defined by:

SGq,σ(D)f({𝐱:𝐱D, sampled w.p. q})+Z,subscriptSG𝑞𝜎𝐷𝑓conditional-set𝐱𝐱𝐷, sampled w.p. 𝑞𝑍\text{SG}_{q,\sigma}(D)\triangleq f(\{\mathbf{x}:\mathbf{x}\in D\text{, % sampled w.p. }q\})+Z\enspace,SG start_POSTSUBSCRIPT italic_q , italic_σ end_POSTSUBSCRIPT ( italic_D ) ≜ italic_f ( { bold_x : bold_x ∈ italic_D , sampled w.p. italic_q } ) + italic_Z ,

where each 𝐱𝐱\mathbf{x}bold_x is sampled independently with probability q𝑞qitalic_q without replacement and Z𝒩(0,σ2I)similar-to𝑍𝒩0superscript𝜎2𝐼Z\sim\mathcal{N}(0,\sigma^{2}I)italic_Z ∼ caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ).

Theorem 2.6 (SGM satisfies RDP; Mironov et al. (2019)).

Let SGq,σ(D)subscriptSG𝑞𝜎𝐷\text{SG}_{q,\sigma}(D)SG start_POSTSUBSCRIPT italic_q , italic_σ end_POSTSUBSCRIPT ( italic_D ) be defined as Definition 2.5 for function f𝑓fitalic_f. Then SGq,σ(D)subscriptSG𝑞𝜎𝐷\text{SG}_{q,\sigma}(D)SG start_POSTSUBSCRIPT italic_q , italic_σ end_POSTSUBSCRIPT ( italic_D ) satisfies (α,ϵ)𝛼italic-ϵ(\alpha,\epsilon)( italic_α , italic_ϵ )-RDP with

ϵ𝔻α((1q)𝒩(0,σ2)+q𝒩(1,σ2)𝒩(0,σ2)),italic-ϵsubscript𝔻𝛼1𝑞𝒩0superscript𝜎2conditional𝑞𝒩1superscript𝜎2𝒩0superscript𝜎2\epsilon\leq\mathbb{D}_{\alpha}((1-q)\mathcal{N}(0,\sigma^{2})+q\mathcal{N}(1,% \sigma^{2})\|\mathcal{N}(0,\sigma^{2})),\enspaceitalic_ϵ ≤ roman_𝔻 start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( ( 1 - italic_q ) caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_q caligraphic_N ( 1 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ∥ caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) ,

if f(D)f(D)21subscriptnorm𝑓𝐷𝑓superscript𝐷21\|f(D)-f(D^{\prime})\|_{2}\leq 1∥ italic_f ( italic_D ) - italic_f ( italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ 1 for adjacent D,D𝒮𝐷superscript𝐷𝒮D,D^{\prime}\in\cal{S}italic_D , italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S.

DP training/fine-tuning.

To apply differential privacy to machine learning, the quintessential algorithm is DP-SGD (Abadi et al., 2016)—a differentially private version of the SGD algorithm. At each iteration, instead of computing the average gradient on a batch of training samples, DP-SGD applies SGM to the sum of norm-clipped per-sample gradients. The resulting noisy gradient can be used directly in optimization, and the total privacy cost can be derived through composition and post-processing. Although DP-SGD appears straightforward at first, successful application of DP-SGD usually requires excessively large batch sizes to reduce the effect of noisy gradients (Li et al., 2021; De et al., 2022; Sander et al., 2023; Yu et al., 2023). This requirement drastically increases the computational cost of training, limiting the application of DP-SGD mostly to simple datasets and small models, or for fine-tuning a model pre-trained on public data (Cattan et al., 2022; Yu et al., 2021). In the context of DP generative modeling, DP-SGD has only been successfully applied to DP fine-tuning on simple datasets such as MNIST, CIFAR-10 and CelebA (Ghalebikesabi et al., 2023; Lyu et al., 2023). By contrast, our method achieves differential privacy on retrieval-augmented diffusion without DP fine-tuning.

3 Differentially Private RDM

Refer to caption
(a) RDM architecture.
Refer to caption
(b) Samples generated by RDM using an adversarial retrieval dataset.
Figure 2: (a) RDM architecture from Blattmann et al. (2022). (b) Samples generated with a non-private RDM. The retrieval dataset consists of blank images and one illustration of the Eiffel Tower with a Shutterstock watermark. Each row shows samples for a different number of retrieved neighbors k𝑘kitalic_k. The watermark is clearly visible even though it came only from conditioning on the retrieval dataset.

This section exemplifies the privacy risks in RDM, presents our differentially private retrieval-augmented diffusion model (DP-RDM) framework, details its technical implementation and introduces privacy guarantees.

3.1 Motivation: Privacy Risks in Retrieval Augmented Generation

To motivate our DP-RDM framework, we first show that RDM (Blattmann et al., 2022), like standard text-to-image diffusion models, are also vulnerable to sample memorization. The RDM architecture is illustrated in Figure 2(a). To generate an image sample, RDM encodes a text prompt using the CLIP text encoder (Radford et al., 2021) and retrieves k𝑘kitalic_k nearest neighbors from the retrieval dataset, which contains CLIP-embedded images. The retrieved neighbors are then used as conditioning vectors in lieu of the text prompt for the diffusion model.

We consider information stored in the retrieval dataset to be private. To illustrate the memorization risk in RDMs, we craft an adversarial retrieval dataset that consists of an image of the Eiffel Tower with a Shutterstock watermark (see Figure 2(b)) and a number of blank images. We vary the number of retrieved neighbors k𝑘kitalic_k and show image samples generated by the RDM in Figure 2(b). Note that the watermark is visible in the generated images even when k=6𝑘6k=6italic_k = 6, which is a clear indication of sample-level information leakage. While this experiment was constructed with worst-case assumptions, we argue that such assumptions are reasonable when the data coverage of the retrieval dataset is sparse, and the queries can be chosen adversarially.

3.2 DP-RDM Architecture

Although RDM fails to provide privacy protection in the worst case, its architecture is already amenable to fine-grained control of information leakage. Samples in the retrieval dataset leak information only through the retrieval mechanism, and influence on the generated sample can be attributed precisely to the retrieved conditioning vectors. Thus by building a “privacy boundary” around the retrieval mechanism and bounding the influence of each sample on the conditioning vector, we can ensure rigorous and quantifiable privacy protection through DP (see Section 3.4).

Refer to caption
Figure 3: Text-to-image generation with DP-RDM using a private retrieval dataset. Yellow blocks refer to models trained on public datasets. The private k𝑘kitalic_k-NN block, denoted in pink, illustrates the privacy boundary between public and private data.

input :

LDM ϕitalic-ϕ\phiitalic_ϕ, public retrieval dataset U𝑈Uitalic_U; private retrieval dataset D𝐷Ditalic_D; text prompt 𝐲d𝐲superscript𝑑\mathbf{y}\in\mathbb{R}^{d}bold_y ∈ roman_ℝ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT; prior ξ𝒩(0,I)similar-to𝜉𝒩0𝐼\xi\sim\mathcal{N}(0,I)italic_ξ ∼ caligraphic_N ( 0 , italic_I )

input :

Noise parameter σ>0𝜎0\sigma>0italic_σ > 0, num. of neighbors k𝑘k\in\mathbb{N}italic_k ∈ roman_ℕ, query interpolation λ[0,1]𝜆01\lambda\in[0,1]italic_λ ∈ [ 0 , 1 ], subsample rate q[0,1]𝑞01q\in[0,1]italic_q ∈ [ 0 , 1 ].

k𝑘kitalic_k-NN retrieval:

D=𝖱𝖺𝗇𝖽𝗈𝗆𝖲𝗎𝖻𝗌𝖾𝗍(D,q)superscript𝐷𝖱𝖺𝗇𝖽𝗈𝗆𝖲𝗎𝖻𝗌𝖾𝗍𝐷𝑞D^{\prime}=\mathsf{RandomSubset}(D,q)italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = sansserif_RandomSubset ( italic_D , italic_q )

Nk=𝖭𝖾𝖺𝗋𝖾𝗌𝗍𝖭𝖾𝗂𝗀𝗁𝖻𝗈𝗋𝗌(D,𝐲,k)subscript𝑁𝑘𝖭𝖾𝖺𝗋𝖾𝗌𝗍𝖭𝖾𝗂𝗀𝗁𝖻𝗈𝗋𝗌superscript𝐷𝐲𝑘N_{k}=\mathsf{NearestNeighbors}(D^{\prime},\mathbf{y},k)italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = sansserif_NearestNeighbors ( italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_y , italic_k )

Ok=𝖭𝖾𝖺𝗋𝖾𝗌𝗍𝖭𝖾𝗂𝗀𝗁𝖻𝗈𝗋𝗌(U,𝐲,k)subscript𝑂𝑘𝖭𝖾𝖺𝗋𝖾𝗌𝗍𝖭𝖾𝗂𝗀𝗁𝖻𝗈𝗋𝗌𝑈𝐲𝑘O_{k}=\mathsf{NearestNeighbors}(U,\mathbf{y},k)italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = sansserif_NearestNeighbors ( italic_U , bold_y , italic_k )

Noisy aggregation:

𝐳=1k(j=1kNk(j))+𝒩(0,σ2I)𝐳1𝑘superscriptsubscript𝑗1𝑘superscriptsubscript𝑁𝑘𝑗𝒩0superscript𝜎2𝐼\mathbf{z}=\frac{1}{k}\left(\sum_{j=1}^{k}N_{k}^{(j)}\right)+\mathcal{N}(0,% \sigma^{2}I)bold_z = divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ( ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) + caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I )

Query interpolation:

𝐞=(1λ)Ok+λ[𝐳||𝐳]𝐞1𝜆subscript𝑂𝑘𝜆delimited-[]𝐳𝐳\mathbf{e}=(1-\lambda)\cdot O_{k}+\lambda\cdot[\mathbf{z}|\cdots|\mathbf{z}]bold_e = ( 1 - italic_λ ) ⋅ italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_λ ⋅ [ bold_z | ⋯ | bold_z ]

return ϕ(𝐞,ξ)italic-ϕ𝐞𝜉\phi(\mathbf{e},\xi)italic_ϕ ( bold_e , italic_ξ )

Algorithm 1 Generating Samples with DP-RDM.
Figure 4: Pseudo-code description of the private image generation procedure.

Figure 4 depicts the DP-RDM workflow, whose steps are summarized in Algorithm 1. The framework is based on RDM, which retrieves image embeddings from a database given an input text-query, and uses the retrieved embeddings to condition the image generation process. We endow the retrieval augmentation system with a private k𝑘kitalic_k-NN retrieval module, which produces a privatized image embedding. This privatized image embedding may be interpolated with embeddings of public images retrieved using the text-query. The resulting interpolated embedding is used to condition the reverse diffusion process. We detail the components of the private k𝑘kitalic_k-NN retrieval module in the following.

Retrieval dataset construction.

Let D𝐷Ditalic_D be a retrieval dataset constructed by encoding a private image dataset using a pre-trained CLIP image encoder (Radford et al., 2021) trained on a public dataset U𝑈Uitalic_U. Each encoded image is a unit d𝑑ditalic_d-dimensional vector. Since the CLIP encoder was trained on a public retrieval dataset, we assume that it is available both at training and sampling time.

k-NN retrieval with random subset selection.

Given a text prompt, we first encode it into a unit d𝑑ditalic_d-dimensional vector 𝐲𝐲\mathbf{y}bold_y using the CLIP text encoder. Instead of querying D𝐷Ditalic_D directly, we obtain a subset Dsuperscript𝐷D^{\prime}italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT by subsampling each encoded image in D𝐷Ditalic_D independently with probability q[0,1]𝑞01q\in[0,1]italic_q ∈ [ 0 , 1 ]. This is done to utilize privacy amplification via subsampling (Mironov et al., 2019). We then query Dsuperscript𝐷D^{\prime}italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT using 𝐲𝐲\mathbf{y}bold_y via max inner product search to obtain its k𝑘kitalic_k nearest neighbor set NkDsubscript𝑁𝑘superscript𝐷N_{k}\subseteq D^{\prime}italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⊆ italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

Noisy aggregation.

The retrieved set Nksubscript𝑁𝑘N_{k}italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT must be privatized before use. Rather than conditioning on these retrieved vectors individually, we aggregate them by computing the mean as 1kjNk(j)1𝑘subscript𝑗superscriptsubscript𝑁𝑘𝑗\frac{1}{k}\sum_{j}{N_{k}^{(j)}}divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT. We then apply calibrated Gaussian noise with standard deviation σ>0𝜎0\sigma>0italic_σ > 0 to obtain the privatized embedding vector 𝐳𝐳\mathbf{z}bold_z.

Query interpolation.

Since the training dataset U𝑈Uitalic_U for the RDM is public, we can use it in conjunction with the privatized embedding vector 𝐳𝐳\mathbf{z}bold_z for retrieval augmentation. That is, we retrieve the k𝑘kitalic_k nearest neighbors Oksubscript𝑂𝑘O_{k}italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT from U𝑈Uitalic_U and combine them with 𝐳𝐳\mathbf{z}bold_z by repeating the vector 𝐳𝐳\mathbf{z}bold_z k𝑘kitalic_k times, resulting in 𝐞=(1λ)Ok+λ[𝐳||𝐳]𝐞1𝜆subscript𝑂𝑘𝜆delimited-[]𝐳𝐳\mathbf{e}=(1-\lambda)\cdot O_{k}+\lambda\cdot[\mathbf{z}|\cdots|\mathbf{z}]bold_e = ( 1 - italic_λ ) ⋅ italic_O start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_λ ⋅ [ bold_z | ⋯ | bold_z ].

Note that the contribution from D𝐷Ditalic_D through 𝐳𝐳\mathbf{z}bold_z decreases as λ𝜆\lambdaitalic_λ decreases, which can help improve image quality when σ𝜎\sigmaitalic_σ, from the noisy aggregation step, is large. In the extreme cases when λ=0𝜆0\lambda=0italic_λ = 0 or σ=𝜎\sigma=\inftyitalic_σ = ∞, no private information is leaked from D𝐷Ditalic_D and we rely entirely on U𝑈Uitalic_U.

3.3 Model Training

Although the RDM in Blattmann et al. (2022) can be readily used with the DP-RDM generation algorithm, we can drastically improve generation quality by making a minor change in the training algorithm of the RDM. Rather than conditioning on a set of retrieved embeddings from the public dataset, we can utilize the noisy aggregation module presented in Section 3.2 during training. Hence for each training step t𝑡titalic_t, isotropic Gaussian noise is scaled uniform random (σt𝒰(0,σ)similar-tosubscript𝜎𝑡𝒰0𝜎\sigma_{t}\sim\mathcal{U}(0,\sigma)italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_U ( 0 , italic_σ )) and added to the aggregate embedding. This change helps the model to learn to extract weak signal from the noisy conditioning vector 𝐞𝐞\mathbf{e}bold_e at sample generation time, thereby improving the sample quality substantially. We also substitute the CLIP encoder with the open-source MetaCLIP encoder (Xu et al., 2023). See Section 6 for further details.

3.4 Privacy Guarantee

In the theorem below, we show that running Algorithm 1 on a single query satisfies Rényi DP. As a result, one can leverage the composition theorem of Rényi DP (Mironov, 2017) and conversion theorem (Balle et al., 2020) to prove that handling T𝑇Titalic_T queries using Algorithm 1 satisfies (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-DP.

Theorem 3.1.

Alg. 1 satisfies (α,ϵ)𝛼italic-ϵ(\alpha,\epsilon)( italic_α , italic_ϵ )-RDP with

ϵ=𝔻α(q𝒩(2/k,σ2)+(1q)𝒩(0,σ2)||𝒩(0,σ2)).\epsilon=\mathbb{D}_{\alpha}(q\mathcal{N}(2/k,\sigma^{2})+(1-q)\mathcal{N}(0,% \sigma^{2})\leavevmode\nobreak\ ||\leavevmode\nobreak\ \mathcal{N}(0,\sigma^{2% })).italic_ϵ = roman_𝔻 start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_q caligraphic_N ( 2 / italic_k , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + ( 1 - italic_q ) caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) | | caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) . (1)
Proof.

We first prove the DP guarantee for replacement adjacency; the proof for leave-one-out adjacency is almost identical. Let D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and D2subscript𝐷2D_{2}italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT be neighboring datasets with D1D2=𝐱1,D2D1=𝐱2formulae-sequencesubscript𝐷1subscript𝐷2subscript𝐱1subscript𝐷2subscript𝐷1subscript𝐱2D_{1}\setminus D_{2}=\mathbf{x}_{1},D_{2}\setminus D_{1}=\mathbf{x}_{2}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∖ italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∖ italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT for some 𝐱1,𝐱2subscript𝐱1subscript𝐱2\mathbf{x}_{1},\mathbf{x}_{2}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. For any query 𝐲𝐲\mathbf{y}bold_y, let D1superscriptsubscript𝐷1D_{1}^{\prime}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (resp. D2superscriptsubscript𝐷2D_{2}^{\prime}italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT) be the Poisson subsampled subset of D𝐷Ditalic_D and N1,ksubscript𝑁1𝑘N_{1,k}italic_N start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT (resp. N2,k)N_{2,k})italic_N start_POSTSUBSCRIPT 2 , italic_k end_POSTSUBSCRIPT ) be the k-NN set from Algorithm 1.

Assume that the sampling of D1superscriptsubscript𝐷1D_{1}^{\prime}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and D2superscriptsubscript𝐷2D_{2}^{\prime}italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT share the same randomness. Then, with probability 1q1𝑞1-q1 - italic_q we have that 𝐱1D1subscript𝐱1superscriptsubscript𝐷1\mathbf{x}_{1}\notin D_{1}^{\prime}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∉ italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and 𝐱2D2subscript𝐱2superscriptsubscript𝐷2\mathbf{x}_{2}\notin D_{2}^{\prime}bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∉ italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, hence N1,k=N2,ksubscript𝑁1𝑘subscript𝑁2𝑘N_{1,k}=N_{2,k}italic_N start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT = italic_N start_POSTSUBSCRIPT 2 , italic_k end_POSTSUBSCRIPT and the distribution of 𝐳𝐳\mathbf{z}bold_z is identical under both D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and D2subscript𝐷2D_{2}italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. With probability q𝑞qitalic_q we have 𝐱1D1subscript𝐱1superscriptsubscript𝐷1\mathbf{x}_{1}\in D_{1}^{\prime}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and 𝐱2D2subscript𝐱2superscriptsubscript𝐷2\mathbf{x}_{2}\in D_{2}^{\prime}bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. In this case, if 𝐱1N1,ksubscript𝐱1subscript𝑁1𝑘\mathbf{x}_{1}\notin N_{1,k}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∉ italic_N start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT and 𝐱2N2,ksubscript𝐱2subscript𝑁2𝑘\mathbf{x}_{2}\notin N_{2,k}bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∉ italic_N start_POSTSUBSCRIPT 2 , italic_k end_POSTSUBSCRIPT then N1,k=N2,ksubscript𝑁1𝑘subscript𝑁2𝑘N_{1,k}=N_{2,k}italic_N start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT = italic_N start_POSTSUBSCRIPT 2 , italic_k end_POSTSUBSCRIPT and again the distribution of 𝐳𝐳\mathbf{z}bold_z is identical under both D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and D2subscript𝐷2D_{2}italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Otherwise, without loss of generality assume that 𝐱1D1subscript𝐱1superscriptsubscript𝐷1\mathbf{x}_{1}\in D_{1}^{\prime}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Since 𝐱2=1subscriptnorm𝐱21\|\mathbf{x}\|_{2}=1∥ bold_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 for all 𝐱D1,D2𝐱subscript𝐷1subscript𝐷2\mathbf{x}\in D_{1},D_{2}bold_x ∈ italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we have that:

1k(j=1kN1,k(j))1k(j=1kN2,k(j))2=1k𝐱1𝐱22k,subscriptnorm1𝑘superscriptsubscript𝑗1𝑘superscriptsubscript𝑁1𝑘𝑗1𝑘superscriptsubscript𝑗1𝑘superscriptsubscript𝑁2𝑘𝑗21𝑘subscriptnormsubscript𝐱1𝐱22𝑘\left\|\frac{1}{k}\left(\sum_{j=1}^{k}N_{1,k}^{(j)}\right)-\frac{1}{k}\left(% \sum_{j=1}^{k}N_{2,k}^{(j)}\right)\right\|_{2}=\frac{1}{k}\|\mathbf{x}_{1}-% \mathbf{x}\|_{2}\leq\frac{2}{k}\enspace,∥ divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ( ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 1 , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ( ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT 2 , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ∥ bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - bold_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ divide start_ARG 2 end_ARG start_ARG italic_k end_ARG ,

for some 𝐱D2𝐱subscript𝐷2\mathbf{x}\in D_{2}bold_x ∈ italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Thus Alg. 1 is a subsampled Gaussian mechanism with L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT sensitivity of 2/k2𝑘2/k2 / italic_k. Eq. 1 holds by Theorems 4 and 5 of Mironov et al. (2019). ∎

The statement of Theorem 3.1 gives the RDP parameter ϵitalic-ϵ\epsilonitalic_ϵ in terms of the Rényi divergence between a Gaussian and a mixture of Gaussians. In practice, this divergence can be computed numerically using popular packages such as Opacus Yousefpour et al. (2021) and TensorFlow Privacy.

Privacy-utility analysis.

The privacy-utility trade-off of Alg. 1 depends on three factors: 1. the number of neighbors k𝑘kitalic_k, 2. the Gaussian noise magnitude σ𝜎\sigmaitalic_σ, 3. and the quality of retrieved embeddings Nksubscript𝑁𝑘N_{k}italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Ideally, k𝑘kitalic_k is large and all k𝑘kitalic_k retrieved embeddings are relevant to the text query 𝐲𝐲\mathbf{y}bold_y. This ensures that the privacy loss is low according to Theorem 3.1 and the generated image is faithful to the text query.

Refer to caption
Figure 5: Privacy loss ϵitalic-ϵ\epsilonitalic_ϵ for generating 1,00010001,0001 , 000 images of certain concepts with concept density r𝑟ritalic_r in log-log scale. For concepts with high concept density, it is possible to generate a large number of high-quality images under low privacy cost.

To analyze this trade-off more concretely, for a given text query 𝐲𝐲\mathbf{y}bold_y, we introduce the quantity r[0,1]𝑟01r\in[0,1]italic_r ∈ [ 0 , 1 ] that represents the density of the concept encoded by 𝐲𝐲\mathbf{y}bold_y in the private retrieval dataset D𝐷Ditalic_D. For example, the query “an image of the Eiffel Tower” likely has higher density than the query “an image of the Eiffel Tower from 1929”. Suppose n|D|𝑛𝐷n\triangleq|D|italic_n ≜ | italic_D | and there are at least rn𝑟𝑛rnitalic_r italic_n samples in D𝐷Ditalic_D that are relevant to the query. Under a subsampling rate of q𝑞qitalic_q, for all k𝑘kitalic_k neighbors to be relevant, we must have krqn𝑘𝑟𝑞𝑛k\leq rqnitalic_k ≤ italic_r italic_q italic_n. For a fixed σ𝜎\sigmaitalic_σ and number of queries T𝑇Titalic_T, we can then minimize Eq. 1 over k,q𝑘𝑞k,qitalic_k , italic_q subject to this constraint to determine the minimum ϵitalic-ϵ\epsilonitalic_ϵ for answering T𝑇Titalic_T queries faithfully. Intuitively, when n𝑛nitalic_n increases, the space of k,q𝑘𝑞k,qitalic_k , italic_q becomes larger and it becomes possible to attain a lower ϵitalic-ϵ\epsilonitalic_ϵ.

We simulate this trade-off curve for a retrieval dataset of size n=𝑛absentn=italic_n = 1M, 10M, 100M under a fixed σ=0.1𝜎0.1\sigma=0.1italic_σ = 0.1. Figure 5 shows the privacy loss ϵitalic-ϵ\epsilonitalic_ϵ for generating 1,00010001,0001 , 000 images of a certain concept with density r𝑟ritalic_r. As density increases, ϵitalic-ϵ\epsilonitalic_ϵ drastically decreases as expected. At the same time, increasing the retrieval dataset size from 1M to 100M can reduce the privacy loss by as much as three orders of magnitude for rare concepts. Notably, for a common concept with density r=103𝑟superscript103r=10^{-3}italic_r = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, ϵitalic-ϵ\epsilonitalic_ϵ can be as low as 0.210.210.210.21 when n=𝑛absentn=italic_n = 100M. This analysis shows that the privacy-utility trade-off when generating images of common concepts using DP-RDM can be desirable, especially when scaling up the retrieval dataset.

4 Results

To demonstrate the efficacy of our DP-RDM algorithm, we evaluate its text-to-image generation performance on several large-scale image datasets.

4.1 Experiment Design

4.1.1 Datasets

Evaluation datasets.

We evaluate on three image datasets: CIFAR-10 (Krizhevsky and Hinton, 2009), MS-COCO 2014 (Lin et al., 2014) with face-blurring (Yang et al., 2022), and Shutterstock, a privately licensed dataset of 239M image-caption pairs. Both MS-COCO and Shutterstock contain pairs of images and corresponding text descriptions, and we evaluate by sampling validation image-text pairs and using the text description to perform text-to-image generation. For CIFAR-10, we perform class-conditional image generation using the text prompt [label], high quality photograph where [label] is replaced by one of the CIFAR-10 classes.

Retrieval dataset.

We consider ImageNet (with face-blurring), MS-COCO (with face-blurring), and Shutterstock as retrieval datasets. We treat ImageNet as a public dataset both for training the RDM as well as for retrieval at generation time, while MS-COCO and Shutterstock serve as private retrieval datasets. Going forward, all references to MS-COCO refer to the 2014 version with face blurring. Consequently, when the query interpolation parameter λ=0𝜆0\lambda=0italic_λ = 0, we leverage only public retrieval from ImageNet.

4.1.2 Models and Baselines

We follow the training algorithm detailed in Blattmann et al. (2022) (with minor changes) to obtain two 400M parameter RDMs: RDM-fb and RDM-adapt. For RDM-fb, we use the face-blurred ImageNet dataset and MetaCLIP instead of CLIP. Note that MetaCLIP is also trained with face-blurring (Xu et al., 2023). For RDM-adapt, we further introduce noise in the retrieval mechanism during training as described in Section 3.3, which improves the retrieval mechanism’s robustness to added DP noise at generation time.

The model checkpoints above can be used in two different manners: with public retrieval (PR; c.f. Figure 2(a)) or DP retrieval with private k𝑘kitalic_k-NN (DPR; c.f. Figure 4). Applying the two retrieval methods to different retrieval datasets yields several methods/baselines:

  1. 1.

    Public-only = (RDM-fb PR or RDM-adapt PR) + public retrieval dataset. This is a baseline that only utilizes the public retrieval dataset (i.e. ImageNet), which our DP-RDM method should consistently outperform.

  2. 2.

    Non-private = (RDM-fb PR or RDM-adapt PR) + private retrieval dataset. This baseline uses public retrieval on a private dataset and is thus non-private. The performance of this baseline serves as a reference point that upper bounds the performance of DP retrieval methods.

  3. 3.

    DP-RDM = (RDM-fb DPR or RDM-adapt DPR) + private retrieval dataset. This is our method applied to the two RDM models.

When using a private retrieval dataset at sample generation time, the models are evaluated using the method described in Algorithm 1. For our experiments, we fix ϵitalic-ϵ\epsilonitalic_ϵ, the query budget, T𝑇Titalic_T, and evaluate a range of neighbors k𝑘kitalic_k, subsample rate q𝑞qitalic_q and private noise σ𝜎\sigmaitalic_σ. For experiments with RDM-adapt DPR, we fix σ𝜎\sigmaitalic_σ and do binary search to find k𝑘kitalic_k (for fixed q𝑞qitalic_q) or q𝑞qitalic_q (for fixed k𝑘kitalic_k). For each method, we generate 15k samples, one for each prompt from a hold-out set.

4.1.3 Evaluation Metrics

We measure quality, diversity and prompt-generation consistency for generated images using the Fréchet inception distance (FID) score (Heusel et al., 2017), CLIPScore (Hessel et al., 2021), density and coverage (Naeem et al., 2020). The FID score measures the Fréchet distance between two image distributions by passing each image through an Inception V3 network (Szegedy et al., 2015). A lower FID score is better. The CLIPScore quantifies how close the generated samples are to the query prompt where higher is better. Density and coverage are proxies for image quality and diversity (Naeem et al., 2020). Both of these metrics require fixing a neighborhood of K𝐾Kitalic_K samples and measuring the nearest-neighbor distance between samples, where NNDKsubscriptNND𝐾\text{NND}_{K}NND start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT computes the distance between sample Risubscript𝑅𝑖R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the Kthsuperscript𝐾thK^{\text{th}}italic_K start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT nearest neighbor. Let B(w,u)𝐵𝑤𝑢B(w,u)italic_B ( italic_w , italic_u ) be the ball in dsuperscript𝑑\mathbb{R}^{d}roman_ℝ start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT around point w𝑤witalic_w with radius u𝑢uitalic_u. Given real samples R𝑅Ritalic_R and fake samples F𝐹Fitalic_F, then density for fixed K𝐾Kitalic_K, is defined as,

density1KFj=1Fi=1R𝟏FjB(Ri,NNDK(Ri)).density1𝐾𝐹superscriptsubscript𝑗1𝐹superscriptsubscript𝑖1𝑅subscript1subscript𝐹𝑗𝐵subscript𝑅𝑖subscriptNND𝐾subscript𝑅𝑖\text{density}\triangleq\frac{1}{KF}\sum_{j=1}^{F}\sum_{i=1}^{R}\mathbf{1}_{F_% {j}\in B(R_{i},\text{NND}_{K}(R_{i}))}\enspace.density ≜ divide start_ARG 1 end_ARG start_ARG italic_K italic_F end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT bold_1 start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_B ( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , NND start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) end_POSTSUBSCRIPT .

Note that it is possible for density>1density1\text{density}>1density > 1. Coverage measures the existence of a generated sample near a real sample and is bounded between 0 and 1. Coverage is defined as,

coverage1Ri=1R𝟏j s.t. FjB(Ri,NNDK(Ri)).coverage1𝑅superscriptsubscript𝑖1𝑅subscript1𝑗 s.t. subscript𝐹𝑗𝐵subscript𝑅𝑖subscriptNND𝐾subscript𝑅𝑖\text{coverage}\triangleq\frac{1}{R}\sum_{i=1}^{R}\mathbf{1}_{\exists\ j\text{% s.t. }F_{j}\in B(R_{i},\text{NND}_{K}(R_{i}))}\enspace.coverage ≜ divide start_ARG 1 end_ARG start_ARG italic_R end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT bold_1 start_POSTSUBSCRIPT ∃ italic_j s.t. italic_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_B ( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , NND start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) end_POSTSUBSCRIPT .

We fix K=5𝐾5K=5italic_K = 5 when computing coverage and density in our experiments.

4.2 Quantitative Results

Refer to caption
Figure 6: FID score for images generated by DP-RDM under a privacy budget of ϵ=10italic-ϵ10\epsilon=10italic_ϵ = 10. With both MS-COCO and Shutterstock retrieval datasets we observe a privacy gain for a wide range of number of private queries.
FID score.

We first demonstrate the performance of DP-RDM quantitatively in terms of the Fréchet inception distance (FID) score. Figure 6 shows the FID score for DP-RDM on MS-COCO and Shutterstock when answering T𝑇Titalic_T queries privately for T𝑇Titalic_T ranging from 1 to 10k. The evaluation dataset and the private retrieval dataset are the same, i.e. for Figure 6(a) we treat the MS-COCO validation set as the evaluation dataset and MS-COCO training set as the private retrieval dataset. We make the following observations.

  • DP-RDM achieves a desirable trade-off between model utility and privacy. The public-only (red line) and non-private (dotted blue line) baselines define the region of interest for the FID score of DP-RDM. Any region below the non-private line is unattainable, and any region above the public-only line provides no privacy-utility benefit compared to using only the public data. For most settings of the interpolation parameter λ𝜆\lambdaitalic_λ and number of queries T𝑇Titalic_T, the FID score of DP-RDM sits comfortably within this region and is closer to the non-private baseline. At a high level, Figure 6 suggests that at ϵ=10italic-ϵ10\epsilon=10italic_ϵ = 10, DP-RDM can obtain a desirable privacy-utility trade-off even for as many as 10k queries.

  • Query interpolation helps improve model utility when answering a large number of queries. Query interpolation with λ=1𝜆1\lambda=1italic_λ = 1 uses only the retrieved aggregated embedding from the private retrieval dataset, which is injected with Gaussian noise to ensure differential privacy. As a result, FID suffers when answering a large number of queries. In comparison, by interpolating with the public retrieval dataset (i.e., λ=0.5,0.8𝜆0.50.8\lambda=0.5,0.8italic_λ = 0.5 , 0.8), we see that DP-RDM retains a low FID even when answering as many as 10k queries.

  • Increasing the size of the retrieval dataset greatly improves model utility. The analysis in Figure 5 suggests that scaling up the retrieval dataset is crucial for maintaining a good privacy-utility trade-off. Our result validates this hypothesis concretely: FID score for Shutterstock stays flat when ranging T𝑇Titalic_T from 1 to 10k, whereas FID score for MS-COCO increases sharply when T=𝑇absentT=italic_T = 10k for λ=0.8,1𝜆0.81\lambda=0.8,1italic_λ = 0.8 , 1. Our DP-RDM architecture is amenable to the use of a large private retrieval dataset such as Shutterstock, whereas using such internet-scale high-fidelity datasets for DP training/finetuning of state-of-the-art generative models is out of reach at the moment.

See Section 7.2 for similar results with CIFAR-10, and Section 10 for detailed results with ϵ={5,10,20}italic-ϵ51020\epsilon=\{5,10,20\}italic_ϵ = { 5 , 10 , 20 }.

Refer to caption
(a) CLIPScore
Refer to caption
(b) Density
Refer to caption
(c) Coverage
Figure 7: Quantitative evaluation of DP-RDM on MS-COCO using other metrics. CLIPScore measures the degree of semantic alignment between the text prompt and the generated images, whereas density and coverage reflect the distribution-level similarity between images generated from validation prompts and ground-truth validation images. ϵ=italic-ϵ\epsilon=\inftyitalic_ϵ = ∞ denotes the non-private baseline.
Other metrics.

In Figure 7 we plot the other quantitative metrics, namely CLIPScore, density and coverage, as a function of the number of queries for different values of ϵitalic-ϵ\epsilonitalic_ϵ on MS-COCO. See Figure 14 in Section 7.1 for corresponding plots for Shutterstock. CLIPScore reflects the degree of semantic alignment between the text prompt and the generated image, whereas density and coverage reflect the distribution-level similarity between images generated from validation prompts and ground-truth validation images. As expected, all three metrics decrease as the number of queries increases. For the coverage metric, the gap between DP-RDM and the non-private baseline is particularly large, which suggests that DP-RDM may be suffering from mode-drop**.

4.3 Qualitative Results

In this section, we showcase prompts and privately generated image samples using RDM-adapt with DP retrieval. We sample 15k validation prompts from Shutterstock and MS-COCO and use RDM-adapt DPR with ϵ=10,T=100formulae-sequenceitalic-ϵ10𝑇100\epsilon=10,T=100italic_ϵ = 10 , italic_T = 100 to generate corresponding images, and then select the top 12 for each dataset according to CLIPScore. Figure 8 shows the selected prompts and Figure 9 shows the generated images. Intriguing, DP-RDM is able to generate images of concepts that are unknown to RDM-adapt at training time by utilizing the private retrieval dataset. For example, the prompt “abstract light background” (prompt 9 for Shutterstock) generates a surprisingly coherent image that is unlikely to appear in the ImageNet dataset. These results suggest that DP-RDM can successfully adapt to a new private image dataset under a strong DP guarantee. In Section 8.2 we also show prompts and generated images that scored the lowest in CLIPScore as failure cases.

Ref Validation Prompt
1 image of blue sky with white cloud for background usage.
2 wood texture paint yellow
3 School of Jackfish in Sipadan Malaysia
4 A bunch of garlic isolated on white background
5 Halloween pumpkins on wooden table on dark color background
6 cauliflower
7 Closeup of golden art sculpture pagoda and golden tiered umbrella in temple against with blue sky in north of Thailand. Faith and religion concept.
8 White stork hunting in a meadow
9 abstract light background
10 White grunge wood close-up background. Painted old wooden wall.
11 Multicolored abstract background holidays lights in motion blur image. illustration digital.
12 Torn blue jean fabric texture.
(a) Shutterstock.
Ref Validation Prompt
1 A dog watching a television that is displaying a picture of a dog.
2 The Asian girls are sharing conversation under a parasol.
3 An aerial view of several jumbled cars on a narrow road.
4 a short passenger train on a track by a grassy hill
5 A kitchen scene with focus on the microwave and oven.
6 An empty wooden bench near a large body of water.
7 A brown bear on two legs is near the water.
8 An adult and a baby elephant crossing a road.
9 A man rides a motorcycle on a track.
10 A bear that is in the grass in the wild.
11 A black-and-white photo of a large bench on the sidewalk.
12 A view of a sink and toilet in a bathroom in the dark.
(b) MS-COCO.
Figure 8: Text prompts for generated samples with high CLIPScore for RDM-adapt DPR.
Refer to caption
(a) Shutterstock, (k=17,λ=0.8,q=0.001formulae-sequence𝑘17formulae-sequence𝜆0.8𝑞0.001k=17,\lambda=0.8,q=0.001italic_k = 17 , italic_λ = 0.8 , italic_q = 0.001).
Refer to caption
(b) MS-COCO, (k=19,λ=0.8,q=0.01formulae-sequence𝑘19formulae-sequence𝜆0.8𝑞0.01k=19,\lambda=0.8,q=0.01italic_k = 19 , italic_λ = 0.8 , italic_q = 0.01).
Figure 9: Images generated using RDM-adapt DPR with privacy budget ϵ=10italic-ϵ10\epsilon=10italic_ϵ = 10 under 100 queries.

4.4 Ablations

Effect of adapted RDM training.

In Section 3.3 we detailed changes we made to the RDM training pipeline to better adapt it to our DP-RDM algorithm. The most critical change is the addition of Gaussian noise in the aggregated embeddings during training, which helps the RDM to learn to extract weak signals from noisy embeddings. Figure 10(a) shows the FID score on MS-COCO when using the non-adapted RDM-fb model in DP-RDM. It is evident that without adaptation, the model is unable to generate coherent images when the number of queries T𝑇Titalic_T increases.

Effect of the parameter k𝑘kitalic_k.

The number of retrieved neighbors k𝑘kitalic_k plays a critical role in controlling the privacy-utility trade-off for DP-RDM. When k𝑘kitalic_k is large, the privacy cost ϵitalic-ϵ\epsilonitalic_ϵ decreases at the expense of aggregating embeddings retrieved from irrelevant neighbors. In Figure 10(b) and Figure 10(c), we plot the histogram of computed L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norms on a fixed sample of MS-COCO validation prompts. Intuitively, a high L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm indicates strong semantic coherence between the retrieved neighbors, hence the generated image is more likely to be faithful to the text prompt. Compared to MS-COCO retrieval in Figure 10(b), Shutterstock is less susceptible to the negative effect of aggregating across a large number of neighbors, despite the validation prompts coming from MS-COCO validation. We suspect that this is primarily due to the size of Shutterstock being 2,900absent2900\approx 2,900≈ 2 , 900 times larger than the MS-COCO retrieval dataset, and partially explains the strong performance of Shutterstock in Figure 6.

Refer to caption
(a) MS-COCO (RDM-fb DPR)
Refer to caption
(b) MS-COCO L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm
Refer to caption
(c) Shutterstock L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm
Figure 10: Ablation studies. (a) FID score vs. number of queries when using DP-RDM with the RDM-fb model. Without adapting the RDM to noise in the retrieved embeddings at training time, the model is uncompetitive compared to the public-only baseline. (b,c) Histogram of the L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm of aggregated embeddings across different validation prompts. As k𝑘kitalic_k increases, more irrelevant neighbors are included in the aggregated embedding, which reduces the retrieved embedding’s L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm.

5 Conclusion

We showcased DP-RDM, the first differentially private retrieval-augmented architecture for text-to-image generation. DP-RDM enables adaptation of a diffusion model trained on public data to a private domain without costly fine-tuning. By scaling up the retrieval dataset, DP-RDM can generate a large number of high-quality images (as many as 10k) under a fixed privacy budget, thereby advancing the state-of-the-art in DP image generation.

Limitations.

While our method achieves competitive results for DP image generation, there are some clear limitations and opportunities for future work.

  1. 1.

    The current privacy analysis is based on worst-case assumptions on the query and retrieval dataset. Variants of DP such as individual-level DP (Feldman and Zrnic, 2021) offer more flexible privacy accounting on a sample-level basis, which is beneficial to DP-RDM as it can assign a different privacy budget for each sample and expend it based on the query (Zhu et al., 2023).

  2. 2.

    RAG can easily satisfy requirements such as the right-to-be-forgotten for samples in its retrieval dataset. DP-RDM can leverage this property to guarantee both privacy and right-to-be-forgotten at the same time, although further care is needed to ensure that it interfaces well with formal data deletion definitions such as (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta)( italic_ϵ , italic_δ )-unlearning (Ginart et al., 2019; Guo et al., 2019).

  3. 3.

    Our paper focuses on private RAG for images. Can a similar technique be successfully applied to retrieval-augmented language models (Lewis et al., 2020)?

References

  • Abadi et al. (2016) Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308–318, 2016.
  • Balle et al. (2020) Borja Balle, Gilles Barthe, Marco Gaboardi, Justin Hsu, and Tetsuya Sato. Hypothesis testing interpretations and renyi differential privacy. In International Conference on Artificial Intelligence and Statistics, pages 2496–2506. PMLR, 2020.
  • Blattmann et al. (2022) Andreas Blattmann, Robin Rombach, Kaan Oktay, Jonas Müller, and Björn Ommer. Retrieval-augmented diffusion models. Advances in Neural Information Processing Systems, 35:15309–15324, 2022.
  • Bourtoule et al. (2021) Lucas Bourtoule, Varun Chandrasekaran, Christopher A Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. Machine unlearning. In 2021 IEEE Symposium on Security and Privacy (SP), pages 141–159. IEEE, 2021.
  • Carlini et al. (2023) Nicolas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramer, Borja Balle, Daphne Ippolito, and Eric Wallace. Extracting training data from diffusion models. In 32nd USENIX Security Symposium (USENIX Security 23), pages 5253–5270, 2023.
  • Casanova et al. (2021) Arantxa Casanova, Marlene Careil, Jakob Verbeek, Michal Drozdzal, and Adriana Romero. Instance-conditioned GAN. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021.
  • Cattan et al. (2022) Yannis Cattan, Christopher A Choquette-Choo, Nicolas Papernot, and Abhradeep Thakurta. Fine-tuning with differential privacy necessitates an additional hyperparameter search. arXiv preprint arXiv:2210.02156, 2022.
  • Chen et al. (2022) Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W Cohen. Re-imagen: Retrieval-augmented text-to-image generator. arXiv preprint arXiv:2209.14491, 2022.
  • Chrabaszcz et al. (2017) Patryk Chrabaszcz, Ilya Loshchilov, and Frank Hutter. A downsampled variant of imagenet as an alternative to the cifar datasets. arXiv preprint arXiv:1707.08819, 2017.
  • De et al. (2022) Soham De, Leonard Berrada, Jamie Hayes, Samuel L Smith, and Borja Balle. Unlocking high-accuracy differentially private image classification through scale. arXiv preprint arXiv:2204.13650, 2022.
  • Dockhorn et al. (2022) Tim Dockhorn, Tianshi Cao, Arash Vahdat, and Karsten Kreis. Differentially private diffusion models. arXiv preprint arXiv:2210.09929, 2022.
  • Dwork et al. (2006) Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, pages 265–284. Springer, 2006.
  • Feldman and Zrnic (2021) Vitaly Feldman and Tijana Zrnic. Individual privacy accounting via a renyi filter. Advances in Neural Information Processing Systems, 34:28080–28091, 2021.
  • Ghalebikesabi et al. (2023) Sahra Ghalebikesabi, Leonard Berrada, Sven Gowal, Ira Ktena, Robert Stanforth, Jamie Hayes, Soham De, Samuel L Smith, Olivia Wiles, and Borja Balle. Differentially private diffusion models generate useful synthetic images. arXiv preprint arXiv:2302.13861, 2023.
  • Ginart et al. (2019) Antonio Ginart, Melody Guan, Gregory Valiant, and James Y Zou. Making ai forget you: Data deletion in machine learning. Advances in neural information processing systems, 32, 2019.
  • Guo et al. (2019) Chuan Guo, Tom Goldstein, Awni Hannun, and Laurens Van Der Maaten. Certified data removal from machine learning models. arXiv preprint arXiv:1911.03030, 2019.
  • Harder et al. (2023) Frederik Harder, Milad Jalali, Danica J Sutherland, and Mijung Park. Pre-trained perceptual features improve differentially private image generation. Transactions on Machine Learning Research, 2023.
  • Hessel et al. (2021) Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Ye** Choi. Clipscore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, 2021.
  • Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  • Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  • Jegou et al. (2010) Herve Jegou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128, 2010.
  • Johnson et al. (2019) Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019.
  • Krizhevsky and Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, Toronto, Ontario, 2009.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
  • Lewis et al. (2020) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  • Li et al. (2021) Xuechen Li, Florian Tramer, Percy Liang, and Tatsunori Hashimoto. Large language models can be strong differentially private learners. arXiv preprint arXiv:2110.05679, 2021.
  • Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  • Lyu et al. (2023) Saiyue Lyu, Margarita Vinaroz, Michael F Liu, and Mijung Park. Differentially private latent diffusion models. arXiv preprint arXiv:2305.15759, 2023.
  • Mehta et al. (2022) Harsh Mehta, Abhradeep Thakurta, Alexey Kurakin, and Ashok Cutkosky. Large scale transfer learning for differentially private image classification. arXiv preprint arXiv:2205.02973, 2022.
  • Mironov (2017) Ilya Mironov. Rényi differential privacy. In 2017 IEEE 30th computer security foundations symposium (CSF), pages 263–275. IEEE, 2017.
  • Mironov et al. (2019) Ilya Mironov, Kunal Talwar, and Li Zhang. Rényi differential privacy of the sampled gaussian mechanism. arXiv preprint arXiv:1908.10530, 2019.
  • Naeem et al. (2020) Muhammad Ferjad Naeem, Seong Joon Oh, Youngjung Uh, Yunjey Choi, and Jaejun Yoo. Reliable fidelity and diversity metrics for generative models. In International Conference on Machine Learning, pages 7176–7185. PMLR, 2020.
  • Parmar et al. (2022) Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in gan evaluation. In CVPR, 2022.
  • Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • Rényi (1961) Alfréd Rényi. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, volume 4, pages 547–562. University of California Press, 1961.
  • Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  • Saharia et al. (2022) Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  • Sander et al. (2023) Tom Sander, Pierre Stock, and Alexandre Sablayrolles. Tan without a burn: Scaling laws of dp-sgd. In International Conference on Machine Learning, pages 29937–29949. PMLR, 2023.
  • Sheynin et al. (2022) Shelly Sheynin, Oron Ashual, Adam Polyak, Uriel Singer, Oran Gafni, Eliya Nachmani, and Yaniv Taigman. Knn-diffusion: Image generation via large-scale retrieval. arXiv preprint arXiv:2204.02849, 2022.
  • Somepalli et al. (2023) Gowthami Somepalli, Vasu Singla, Micah Goldblum, Jonas Gei**, and Tom Goldstein. Diffusion art or digital forgery? investigating data replication in diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6048–6058, 2023.
  • Song et al. (2020) Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  • Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
  • Tiwari et al. (2023) Trishita Tiwari, Suchin Gururangan, Chuan Guo, Weizhe Hua, Sanjay Kariyappa, Udit Gupta, Wenjie Xiong, Kiwan Maeng, Hsien-Hsin S Lee, and G Edward Suh. Information flow control in machine learning through modular model architecture. arXiv preprint arXiv:2306.03235, 2023.
  • Wutschitz et al. (2023) Lukas Wutschitz, Boris Köpf, Andrew Paverd, Saravan Rajmohan, Ahmed Salem, Shruti Tople, Santiago Zanella-Béguelin, Menglin Xia, and Victor Rühle. Rethinking privacy in machine learning pipelines from an information flow control perspective. arXiv preprint arXiv:2311.15792, 2023.
  • Xu et al. (2023) Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying clip data. arXiv preprint arXiv:2309.16671, 2023.
  • Yang et al. (2022) Kaiyu Yang, Jacqueline Yau, Li Fei-Fei, Jia Deng, and Olga Russakovsky. A study of face obfuscation in imagenet. In International Conference on Machine Learning (ICML), 2022.
  • Yasunaga et al. (2022) Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Rich James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. Retrieval-augmented multimodal language modeling. arXiv preprint arXiv:2211.12561, 2022.
  • Yousefpour et al. (2021) Ashkan Yousefpour, Igor Shilov, Alexandre Sablayrolles, Davide Testuggine, Karthik Prasad, Mani Malek, John Nguyen, Sayan Ghosh, Akash Bharadwaj, Jessica Zhao, et al. Opacus: User-friendly differential privacy library in pytorch. arXiv preprint arXiv:2109.12298, 2021.
  • Yu et al. (2021) Da Yu, Saurabh Naik, Arturs Backurs, Sivakanth Gopi, Huseyin A Inan, Gautam Kamath, Janardhan Kulkarni, Yin Tat Lee, Andre Manoel, Lukas Wutschitz, et al. Differentially private fine-tuning of language models. In International Conference on Learning Representations, 2021.
  • Yu et al. (2023) Yaodong Yu, Maziar Sanjabi, Yi Ma, Kamalika Chaudhuri, and Chuan Guo. Vip: A differentially private foundation model for computer vision. arXiv preprint arXiv:2306.08842, 2023.
  • Zhu et al. (2020) Yuqing Zhu, Xiang Yu, Manmohan Chandraker, and Yu-Xiang Wang. Private-knn: Practical differential privacy for computer vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11854–11862, 2020.
  • Zhu et al. (2023) Yuqing Zhu, Xuandong Zhao, Chuan Guo, and Yu-Xiang Wang. " private prediction strikes back!”private kernelized nearest neighbors with individual renyi filter. arXiv preprint arXiv:2306.07381, 2023.
\beginappendix

6 Appendix: Training Details

In Figure 11 we detail the training procedure.

We train on a face-blurred version of ImageNet (Krizhevsky et al., 2012; Yang et al., 2022) to enhance privacy of the pre-training dataset. While this method is not exhaustive, and in some cases will blur non-human subjects, this method effectively mitigates facial memorization by the trained model. Both models (RDM-fb PR  and RDM-adapt PR ) were trained using single patches, center-cropped on ImageNet FB.

Refer to caption
Figure 11: DP-RDM follows a typical training process for text-to-image generative models. The training process can be done entirely with public data.

DP-RDM supports two aggregation methods. RDM-fb PR  instantiates the Concatenation module, which is the standard approach for training RDM (see Blattmann et al. (2022) for details). Alternatively, DP-RDM can be trained using the same Noisy Aggregation approach done during private sample generation. Due to the bound on global sensitivity at sample generation time, we ensure embeddings are normalized during training to match the normalization procedure when samples are generated.

RDM (Blattmann et al., 2022) RDM-fb PR RDM-adapt PR
Steps 120k 110k 140k
Batch 1280 3840 3840
Encoder CLIP ViT-B/32 MetaCLIP ViT-B16 (400M)
Train Dataset ImageNet ImageNet Face-Blurred
Added Noise (σ𝜎\sigmaitalic_σ) 0 0.05
Aggregation No Yes
Figure 12: Training Setup. Experimental results were performed on models trained with face-blurring (FB) in the CLIP encoder, the conditioning and the training dataset.
Picking a Noise Magnitude for Training.

Remark that all embeddings are normalized, and therefore the aggregated embedding will be no greater than 1. We therefore consider the highest amount of training noise such that σd1𝜎𝑑1\sigma\cdot\sqrt{d}\approx 1italic_σ ⋅ square-root start_ARG italic_d end_ARG ≈ 1. When training RDM-adapt PR, d=512𝑑512d=512italic_d = 512 and σ=0.05𝜎0.05\sigma=0.05italic_σ = 0.05.

Efficient subsampling and nearest neighbor search.

For large-scale retrieval datasets with millions of samples, it is necessary to adopt fast nearest neighbor search for computational efficiency. To this end, we use the Faiss library (Johnson et al., 2019), which is known to support fast inner product search for billion-scale vector databases. To support subsampling, we use the selector array functionality (IDSelectorBatch), which allows one to specify a binary index vector that limits the inner product search to a subset of the full retrieval dataset. We apply product quantization (Jegou et al., 2010) to the vector database in the case of Shutterstock. Note that product quantization requires training as well, which can also leak private information if done on the retrieval dataset itself. For retrieval from CIFAR-10, ImageNet FB, MS-COCO FB and the adversarial dataset, we use a flat index without index training. The Shutterstock index is built using an inverted file index and product quantization (IVFPQ) using an L2 quantizer, 8,19281928,1928 , 192 centroids, and a code size of 256256256256.

Privacy Analysis.

In Figure 13 we show how DP-RDM is calibrated across a range of neighbors and noise magnitudes. This mechanism can be tuned using in terms of the subsampling rate q𝑞qitalic_q, number of neighbors k𝑘kitalic_k and additive noise σ𝜎\sigmaitalic_σ.

Refer to caption
Figure 13: DP-RDM privacy analysis for a 100M dataset, over 10k fixed queries using privacy amplification by subsampling.

7 Further Experiment Details

All experiments were performed with 100100100100 DDIM steps. ImageNet FB, Shutterstock and MS-COCO samples were generatured using a classifier-free guidance scale of 2.02.02.02.0. CIFAR-10 experiments were done with the guidance scale set to 1.251.251.251.25. We resize and center crop each image before encoding the image and storing the normalized embedding. To evaluate FID, we use work by Parmar et al. (2022).

7.1 Shutterstock Metrics

Refer to caption
(a) CLIPScore
Refer to caption
(b) Density
Refer to caption
(c) Coverage
Figure 14: Shutterstock Density and Coverage plots over a range of privacy parameters. The dashed line (ϵ=italic-ϵ\epsilon=\inftyitalic_ϵ = ∞) corresponds with the non-private baseline.

7.2 CIFAR-10

CIFAR-10 images are 32×32323232\times 3232 × 32 pixels. We therefore up-sample them to match our models. We found that random subsampling of the retrieval dataset (with q=1/100𝑞1100q=1/100italic_q = 1 / 100) significantly improved quality and diversity of the images generated. We generate 30k samples and report their similarity against pre-computed statistics. At privacy loss ϵ=10italic-ϵ10\epsilon=10italic_ϵ = 10, or no privacy (ϵ=italic-ϵ\epsilon=\inftyitalic_ϵ = ∞), we report the best quality score for k𝑘kitalic_k on CIFAR-10 over 30,000 generated samples. With private diffusion models, there are two notable fine tuning results. Harder et al. (2023) reduce the dependency on private data by using public feature extractor to train a private generator on CIFAR-10, and report FID of 26.826.826.826.8. Another fine-tuning result comes from Ghalebikesabi et al. (2023), where they report FID of 9.89.89.89.8 on CIFAR-10. However, both of these methods train on a downsampled version of ImageNet (Chrabaszcz et al., 2017), which limits their capacity to generate high-quality samples.

FID σ𝜎\sigmaitalic_σ k𝑘kitalic_k λ𝜆\lambdaitalic_λ q𝑞qitalic_q
Model (CIFAR-10) T𝑇Titalic_T ϵitalic-ϵ\epsilonitalic_ϵ
RDM-adapt PR 0 \infty 15.26 0.05 4 1.0 0.01
RDM-adapt DPR 1 10 13.91 0.05 13 1.0 0.01
10 10 14.26 0.05 16 1.0 0.01
100 10 14.34 0.05 19 1.0 0.01
1,000 10 14.52 0.05 23 1.0 0.01
10,000 10 15.19 0.05 33 1.0 0.01
Figure 15: CIFAR-10: Tabular Results
Refer to caption
Figure 16: For ϵ=10italic-ϵ10\epsilon=10italic_ϵ = 10, we compare the privacy trade-off for RDM-adapt DPR and CIFAR-10 retrieval.

8 Appendix: More Generated Samples

Refer to caption
Figure 17: RDM-adapt DPR samples generated for ϵ=10,δ=1/nformulae-sequenceitalic-ϵ10𝛿1𝑛\epsilon=10,\delta=1/nitalic_ϵ = 10 , italic_δ = 1 / italic_n, where n=50,000𝑛50000n=50,000italic_n = 50 , 000 denotes the size of the retrieval datasets (CIFAR-10 train). Here no. neighbors, k=23𝑘23k=23italic_k = 23, and the Sub-sampled Gaussian Mechanism is calibrated for 1,000 private queries over random subsets q=0.01𝑞0.01q=0.01italic_q = 0.01.

8.1 ImageNet (Face Blurred) Results

When λ=0𝜆0\lambda=0italic_λ = 0, then privacy loss ϵ=0italic-ϵ0\epsilon=0italic_ϵ = 0 and the model relies only on the public retrieval dataset.

Refer to caption
(a) Shutterstock Validation
Refer to caption
(b) MS-COCO Validation
Ref Validation Prompt
1 Female Ptarmigan Stood.
2 Yorkshire terrier with a red bow
3 Bowl of coffee beans , isolated on white background.
4 Cute Siberian Husky puppy with pink flowers
5 Stork stands guarding the nestlings in his nest. Beautiful Moon in the background - Banya, Bulgaria
6 Ancient Bon stupa in Saldang village, Nepal. Saldang lies in Nankhang Valley, the most populous of the sparsely populated valleys making up the culturally Tibetan region of Dolpo.
7 Turkish coffee pots made in a traditional style
8 Human hand holding a big snail shell
9 Rugby American football ball isolated on white background. This has clip** path.
10 abstract orange. Summer background with a magnificent sun burst with lens flare. space for your message
11 colored disposable knives lie on a pink and blue background
12 Closeup of golden art sculpture pagoda and golden tiered umbrella in temple against with blue sky in north of Thailand. Faith and religion concept.
(a) RDM-adapt DPR with Shutterstock.
Ref Validation Prompt
1 A white toilet sitting next to a toilet paper roller.
2 A street sign reads both in English and Asian script.
3 A dog inspects an open dishwasher in a kitchen.
4 A bathroom sink has a running facet and bamboo plant.
5 A bathroom with blue walls and white tiles.
6 A teddy bear sitting on a chair inside a room.
7 A computer desk with an open umbrella hooked up to wires.
8 A young baby girl standing in a driveway holding an open umbrella.
9 A bus that is sitting in the grass near a street.
10 A stuffed animal pushed into a toilet.
11 A metal bench sits on a path in the forest.
12 Black and white photograph of a bowl of apples.
(b) RDM-adapt DPR with MS-COCO.

8.2 Failure Cases

In the following figures and tables we show examples of generated samples with the lowest CLIPScore when compared to their validation prompt.

Refer to caption
Figure 20: Shutterstock
Refer to caption
Figure 21: MS-COCO
Ref Validation Prompt
1 Beautiful Seamless Pattern of Watercolor Summer Garden Blooming Flowers on purple background
2 Hello august calligraphy and set of cute and necessary things for relaxing on the beach, isolated on white background.
3 strawberry milk splash isolated on white background
4 Merry Christmas line flat design card with holidays symbols - Santa Claus, Christmas tree, house, mistletoe
5 Illustration of card with bowknot with round frame for text isolated on white background. Black friday sale banner template. Design for Christmas sale, shop**, retail, discount poster
6 Men’s classic shoes isolated on white background
(a) Failure cases with Shutterstock.
Ref Validation Prompt
1 An old plane flying through the cloudy sky.
2 A person sitting by the water with a bike.
3 A man in a neon green vest is holding a stop sign on the newly paved road.
4 Tire marks carve paths down a snow covered street.
5 An elderly man wearing a white button down shirt and a black bow tie smiling.
6 A number of red and green traffic lights on a wide highway.
(b) Failure cases with MS-COCO.

9 Appendix: RDM and DP-RDM Compared

Here we show how RDM-adapt DPR mitigates individual samples from appearing in an adversarially chosen dataset via noisy aggregation.

Refer to caption
(a) Adversarial D𝐷Ditalic_D with RDM
Refer to caption
(b) Adversarial D𝐷Ditalic_D with RDM-adapt DPR

10 Appendix: Quantitative Results

FID CLIPScore Coverage Density
Model Validation
RDM-adapt PR CIFAR-10 40.87 - - -
MS-COCO FB 14.40 0.294 0.85 1.15
Shutterstock 24.32 0.269 0.79 0.95
Figure 24: Imagenet Face Blurred: Tabular Results
FID CLIPScore Coverage Density σ𝜎\sigmaitalic_σ k𝑘kitalic_k λ𝜆\lambdaitalic_λ q𝑞qitalic_q
MS-COCO FB T𝑇Titalic_T ϵitalic-ϵ\epsilonitalic_ϵ
RDM-adapt PR 0 \infty 10.08 0.267 0.88 1.4 0.0 4 1.0 1.0
RDM-adapt DPR 1 5 11.14 0.26 0.86 1.13 0.05 23 1.0 0.05
10 11.35 0.273 0.88 1.25 0.04 21 1.0 0.1277
20 10.68 0.265 0.88 1.14 0.05 10 1.0 0.05
10 5 11.32 0.259 0.86 1.13 0.05 28 1.0 0.05
10 11.36 0.251 0.85 0.97 0.065 16 1.0 0.05
20 10.96 0.263 0.88 1.13 0.05 15 1.0 0.05
100 5 11.57 0.257 0.85 1.1 0.05 37 1.0 0.05
10 11.75 0.249 0.83 0.99 0.065 21 1.0 0.05
20 11.1 0.261 0.86 1.12 0.05 21 1.0 0.05
1,000 5 15.5 0.24 0.75 0.88 0.05 29 1.0 0.01
10 12.14 0.246 0.82 0.94 0.065 34 1.0 0.05
20 11.26 0.258 0.85 1.11 0.05 32 1.0 0.05
10,000 5 20.01 0.218 0.66 0.66 0.08 30 1.0 0.01
10 16.68 0.237 0.73 0.83 0.05 34 1.0 0.01
20 14.61 0.242 0.77 0.91 0.05 26 1.0 0.01
RDM-fb 1 10 17.9 0.232 0.94 0.78 0.03 22 1.0 0.01
10 10 20.86 0.227 0.88 0.62 0.035 22 1.0 0.01
100 10 21.42 0.227 0.6 0.61 0.033 28 1.0 0.01
1,000 10 27.07 0.22 0.49 0.4 0.041 28 1.0 0.01
10,000 10 42.01 0.207 0.3 0.18 0.059 28 1.0 0.01
100,000 10 155.67 0.192 0.06 0.02 0.164 22 1.0 0.01
Figure 25: MS-COCO Face Blurred: Tabular Results
FID CLIPScore Coverage Density σ𝜎\sigmaitalic_σ k𝑘kitalic_k λ𝜆\lambdaitalic_λ q𝑞qitalic_q
Shutterstock T𝑇Titalic_T ϵitalic-ϵ\epsilonitalic_ϵ
RDM-adapt PR 0 \infty 20.09 0.245 0.77 0.71 0.05 4 1.0 1.0
RDM-adapt DPR 1 5 20.77 0.256 0.8 0.84 0.08 16 0.5 0.01
10 20.73 0.244 0.77 0.73 0.065 14 0.8 0.01
20 20.95 0.262 0.8 0.87 0.06 10 0.5 0.01
10 5 21.19 0.255 0.8 0.83 0.08 17 0.5 0.01
10 20.93 0.26 0.8 0.85 0.065 16 0.5 0.01
20 21.05 0.262 0.8 0.88 0.06 12 0.5 0.01
100 5 21.28 0.247 0.78 0.79 0.06 24 0.8 0.01
10 20.99 0.256 0.8 0.84 0.08 14 0.5 0.01
20 20.86 0.244 0.78 0.74 0.065 13 0.8 0.01
1,000 5 21.02 0.247 0.78 0.75 0.06 28 0.8 0.01
10 20.96 0.26 0.81 0.88 0.065 21 0.5 0.01
20 21.09 0.244 0.77 0.74 0.065 16 0.8 0.01
10,000 5 22.06 0.234 0.76 0.73 0.08 36 0.8 0.01
10 21.27 0.263 0.81 0.88 0.05 38 0.5 0.01
20 21.44 0.263 0.81 0.88 0.05 29 0.5 0.01
Figure 26: Shutterstock: Tabular Results