HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: arydshln

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2305.11351v2 [cs.LG] 20 Feb 2024

Data Redaction from Conditional Generative Models

1st Zhifeng Kong Computer Science and Engineering
University of California San Diego
La Jolla, USA
[email protected]
   2nd Kamalika Chaudhuri Computer Science and Engineering
University of California San Diego
La Jolla, USA
[email protected]
Abstract

Deep generative models are known to produce undesirable samples such as harmful content. Traditional mitigation methods include re-training from scratch, filtering, or editing; however, these are either computationally expensive or can be circumvented by third parties. In this paper, we take a different approach and study how to post-edit an already-trained conditional generative model so that it redacts certain conditionals that will, with high probability, lead to undesirable content. This is done by distilling the conditioning network in the models, giving a solution that is effective, efficient, controllable, and universal for a class of deep generative models. We conduct experiments on redacting prompts in text-to-image models and redacting voices in text-to-speech models. Our method is computationally light, leads to better redaction quality and robustness than baseline methods while still retaining high generation quality.

I Introduction

Deep generative models are unsupervised deep learning models that learn a data distribution from samples and then generate new samples from it. These models have shown tremendous success in many domains such as image generation (rombach2021highresolution, ; ramesh2021zero, ; ramesh2022hierarchical, ; sauer2023stylegan, ), audio synthesis (kong2021diffwave, ; lee2023bigvgan, ), and text generation (gpt4, ; touvron2023llama, ). Most modern deep generative models are conditional: the user inputs some context known as the conditionals, and the model generates samples conditioned on the context.

However, as these models have grown more powerful, there has been increasing concern about their trustworthiness: in certain situations, these models produce undesirable outputs. For example, with text-to-image models, one may craft a prompt that contains offensive, biased, malignant, or fabricated content, and generate a high-resolution image that visualizes the prompt (nichol2021glide, ; birhane2021multimodal, ; schuhmann2022laion, ; ramesh2022hierarchical, ; rando2022red, ; nudenet, ; man, ). With speech synthesis models, one may easily turn text into celebrity voices (Betker2022TTS, ; wang2023neural, ; zhang2023speak, ). Text generation models can emit offensive, biased, or toxic content (pitsilis2018detecting, ; wallace2019universal, ; mcguffie2020radicalization, ; gehman2020realtoxicityprompts, ; abid2021persistent, ; perez2022red, ; schramowski2022large, ).

One plausible solution to mitigate this problem is to remove all undesirable samples from the training set and re-train the model. This is too computationally heavy for modern, large models. Another solution is to apply a classifier that filters out undesirable conditionals or outputs (rando2022red, ; nudenet, ; man, ), or to edit the outputs and remove the undesirable content after generation (schramowski2022safe, ). However, in cases where the model owners share the model weights with third parties, they do not have control over whether the filters or editing methods will be used. In order to prevent undesirable outputs more efficiently and reliably, we propose to post-edit the weights of a pre-trained model, which we call data redaction.

The first challenge is how to frame data redaction for conditional generative models. Prior work in data redaction for unconditional generative models considered this problem in the space of outputs, and framed the problem as learning the data distribution restricted to a valid subset of outputs (kong2023data, ). However, a conditional generative model learns a collection of (usually an infinite number of) distributions (one for each conditional) all of which are induced by networks that share weights; therefore, we cannot apply this method one by one for every conditional we would like to redact. In this paper, we frame data redaction for conditional generative models as redacting a set of conditionals that will very likely lead to undesirable content. In particular, we do redaction in the conditional space, instead of separately redacting samples generated from each conditional in the output space.

This statistical machine learning framework inspires us to design a universal, efficient, and effective method for data redaction. We only re-train (or distill) the conditional part of the network by projecting redacted conditionals onto different, non-redacted reference conditionals. It is computationally light because all but the conditioning network is fixed, and we only need to load a small fraction of the dataset for training.

We show there exists an explicit data redaction formula for simple class-conditional models. For more complicated generative models in real-world applications, we introduce a series of techniques to effectively redact certain conditionals but retain high generation quality. These include model-specific distillation losses and training schemes, methods to increase the capacity of the student conditioning network, ways to improve efficiency, and a few others.

We test our data redaction method on two real-world applications: GAN-based text-to-image (zhu2019dm, ) and Diffusion-based text-to-speech (kong2021diffwave, ). For text-to-image, we redact prompts that include certain words or phrases. Our method has significantly better redaction quality and robustness than baseline methods while retaining similar generation quality as the pre-trained model. For text-to-speech, we redact certain voices outside the training set. Our method achieves both high redaction and speech quality. Audio samples can be found on our demo website (https://dataredact2023.github.io/). Our methods for both applications are extremely computationally efficient: redacting text-to-image models takes approximately 0.5 hour, and redacting text-to-speech models takes less than 4 hours, both on one single NVIDIA 3080 GPU. In contrast, training the text-to-image model takes more than a day on one GPU, and training the text-to-speech model takes 2-3 days on 8 GPUs. This demonstrates that data redaction can be done significantly more efficiently than re-training full models from scratch.

Refer to caption
(a) Pre-trained
Refer to caption
(b) Reference
Refer to caption
(c) Our Redaction
Refer to caption
(d) Rewriting (bau2020rewriting, )
Figure 1: Redact ‘‘white belly’’ from text-to-image models (zhu2019dm, ). The prompt is ‘‘this bird has feathers that are black and has a white belly’’. (a) Sample generated from the pre-trained model, which produces a visualization of the prompt. (b) The target sample that redacts ‘‘white belly’’ but keeps the other concepts. (c) Generated sample from our redaction model, which aims to redact ‘‘white belly’’ and approximates the reference sample. (d) Sample generated from the Rewriting baseline, which is blurry and has lower quality. More samples can be found in Appendix B-B.

I-A Related Work

Machine Unlearning. Machine unlearning computes or approximates a re-trained machine learning model after removing certain training samples (cao2015towards, ). Many unlearning methods have been proposed for supervised learning (guo2019certified, ; schelter2020amnesia, ; neel2021descent, ; sekhari2021remember, ; izzo2021approximate, ; ullah2021machine, ; bourtoule2021machine, ; warnecke2021machine, ), among which some provide theoretically guaranteed unlearning or removal for strictly convex classifiers. There is one method approximate deletion method for generative models (kong2022approximate, ), which aims to delete from an unconditional generative model by post-hoc rejection sampling. The goal of data redaction is very different from machine unlearning, which unlearns training samples and is usually in the privacy context, while data redaction prevents undesirable samples from generation regardless whether they are in the training set. 111It is also unclear how to do efficient unlearning for complex conditional generative models (e.g. text-to-X) because it is unclear what exact combination of (text, X) pairs to unlearn and how to do it beyond retraining from scratch. A detailed explanation can be found in Section II-C in (kong2023data, ).

Data Filtering and Semantic Editing. A direct way to prevent certain samples to be generated is to apply a data filter (e.g., a malicious content classifier). The filter can be applied to training data before training (nichol2021glide, ; schuhmann2022laion, ; ramesh2022hierarchical, ), or applied post-hoc to model outputs (rando2022red, ; nudenet, ; man, ). Another line of research has looked at semantically modifying the outputs of generative models. For GANs (goodfellow2014generative, ), (bau2020semantic, ) computes an editing vector in the latent space to alter a semantic concept. For diffusion models (ho2020denoising, ) especially text-to-image models like Stable Diffusion (rombach2021highresolution, ), there are also a number of image editing techniques (bar2022text2live, ; hertz2022prompt, ; kawar2022imagic, ; valevski2022unitune, ; brack2022stable, ). (schramowski2022safe, ) applied image editing to prevent diffusion models from generating malicious images through a safety guidance term that alters the sampling algorithm for inappropriate prompts.

While these filtering and editing methods can be used to prevent malicious images, the model parameters are not modified. Consequently, in cases where the models owners share the model weights with third parties, they do not have control over whether the third parties will use the filters or editing methods. In contrast, our proposed method modifies the model weights to address this issue.

Data Redaction in Unconditional Models. Several works have studied methods to prevent generative models from producing undesirable samples, either by re-training or post-editing. For GANs, (asokan2020teaching, ) and (sinha2021negative, ) investigated re-training methods via modified loss functions that penalize generation of undesirable samples, and (bau2020rewriting, ) and (cherepkov2021navigating, ) introduced post-hoc parameter rewriting techniques for semantic editing, which can be used to remove undesirable artifacts. (kong2023data, ), (malnick2022taming, ), and (moon2023feature, ) designed post-editing data redaction methods for various types of pre-trained generative models.

All these methods are restricted to the unconditional setting as they modify the map** from latent vectors to samples. In contrast, the goal of this paper is to redact data from pre-trained, conditional generative models. In these models, the conditional information heavily controls the content and style of generated samples (e.g. text-to-X), whereas the latent controls variation. It is therefore necessary to also modify the map** from conditional to samples.

Redaction Methods for Stable Diffusion. (gandikota2023erasing, ) fine-tunes Stable Diffusion to incorporate negative guidance on undesirable visual styles (e.g., those under copyright protection). As a result, undesirable samples will not be generated with the standard sampling algorithm. However, one might recover the original score from the distilled score to break this method (see Appendix A for details). (gandikota2023unified, ) proposed an analytic solution for keys in cross attention blocks to edit concepts. A similar approach (zhang2023forget, ) proposed a re-steering mechanism for keys by minimizing the attention maps of target concepts. (heng2023selective, ) and (kumari2023ablating, ) proposed to forget or manipulate concepts by further fine-tuning the entire network with certain continual learning objectives. These methods are heavily designed for text-to-image tasks with Stable Diffusion. They require the model to be trained with either classifier-free guidance (ho2022classifier, ) or cross attention blocks, or they need to fine-tune the entire large diffusion network. In contrast, our proposed method is universal, applies to a broader range of generative models, and applies to multiple data domains. To our knowledge, it is the first method that is able to redact voices from a trained speech synthesis model.

II Preliminaries

Conditional Generative Models. Let 𝒞𝒞\mathcal{C}caligraphic_C be the space of conditionals. It could be a finite set of discrete labels, or an infinite set of continuous representations. 222In cases where there are infinitely many discrete labels such as text or 16-bit floats, these conditionals are usually considered as continuous or transformed to continuous representations. For any c𝒞𝑐𝒞c\in\mathcal{C}italic_c ∈ caligraphic_C there is an underlying data distribution pdata(|c)p_{\mathrm{data}}(\cdot|c)italic_p start_POSTSUBSCRIPT roman_data end_POSTSUBSCRIPT ( ⋅ | italic_c ) (on dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT) conditioned on c𝑐citalic_c. In the discrete label case, this simply corresponds to a finite number of data distributions for all labels. In the more complicated continuous case, there is usually an underlying assumption that pdata(|c)p_{\mathrm{data}}(\cdot|c)italic_p start_POSTSUBSCRIPT roman_data end_POSTSUBSCRIPT ( ⋅ | italic_c ) is Lipschitz with respect to c𝑐citalic_c: that is, pdata(|c)p_{\mathrm{data}}(\cdot|c)italic_p start_POSTSUBSCRIPT roman_data end_POSTSUBSCRIPT ( ⋅ | italic_c ) will not change much if c𝑐citalic_c does not change much.

Let X={(xi,ci)}𝑋subscript𝑥𝑖subscript𝑐𝑖X=\{(x_{i},c_{i})\}italic_X = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } be the set of training data, in which each xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the sample and cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the conditional (for example, xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is an image and cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is its caption). Let G𝐺Gitalic_G be a conditional generative model trained on X𝑋Xitalic_X. G𝐺Gitalic_G has two inputs – a sample latent z𝑧zitalic_z drawn from a Gaussian distribution and a conditional c𝑐citalic_c – and outputs sample x=G(z|c)𝑥𝐺conditional𝑧𝑐x=G(z|c)italic_x = italic_G ( italic_z | italic_c ). For each c𝒞𝑐𝒞c\in\mathcal{C}italic_c ∈ caligraphic_C, G𝐺Gitalic_G draws from a generative distribution pG(|c)p_{G}(\cdot|c)italic_p start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( ⋅ | italic_c ), which is trained to learn pdata(|c)p_{\mathrm{data}}(\cdot|c)italic_p start_POSTSUBSCRIPT roman_data end_POSTSUBSCRIPT ( ⋅ | italic_c ). In the discrete label case, this is equivalent to modeling a finite number of distributions. In the continuous case, G𝐺Gitalic_G also needs to generalize to unseen conditionals, because not all conditionals exist in the training set. We assume that pG(|c)p_{G}(\cdot|c)italic_p start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( ⋅ | italic_c ) learns pdata(|c)p_{\mathrm{data}}(\cdot|c)italic_p start_POSTSUBSCRIPT roman_data end_POSTSUBSCRIPT ( ⋅ | italic_c ) very well, as how to train these models is outside the scope of this paper.

Problem setup. Our goal is to redact a set of conditionals 𝒞Ω𝒞subscript𝒞Ω𝒞\mathcal{C}_{\Omega}\subset\mathcal{C}caligraphic_C start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ⊂ caligraphic_C, referred to as the redaction conditionals, which with high probability lead to undesirable content. For example, for text-to-image models, we may be looking to redact text prompts related to violence or offensive content. 333This does not necessarily redact every possible offensive output; for example, an innocent prompt such as ”a day in the park” might with very low probability result in a violent image which our solution will not address.

We assume that the redaction conditionals are given to us either as a set or described by a classifier. We assume that we are working with an already trained generative model G𝐺Gitalic_G and we are only allowed to post-edit it. Re-training generative models from scratch can be highly compute-intensive, and so our goal is to consider computationally efficient solutions. Additionally, we also want to avoid solutions that involve external filters, since a third-party can choose not to use them. A final requirement of our solution is that it should retain high generation quality for the conditionals that are not to be redacted.

We assume that we have access to the parameters of the network G𝐺Gitalic_G and (part or whole of) its training dataset X𝑋Xitalic_X. 444One setting this assumption holds is when the model owners want to make their model safer. We believe the only possible solution for a closed-source model is filtering. The goal of this paper is to edit the parameters of model G𝐺Gitalic_G to form a new model Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT so that harmful conditionals lead to the generation of benign outputs.

Our proposed solution addresses this problem in the context where the conditioning networks are separate from the main generative network – which holds for most current network architectures – and achieves this by distilling only the conditioning networks.

III Method

In this section, we consider a special solution to our redaction task: for redacted conditionals c𝒞𝑐𝒞c\in\mathcal{C}italic_c ∈ caligraphic_C, we let Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT learn the distribution conditioned on a different, non-redacted conditional c^𝒞𝒞Ω^𝑐𝒞subscript𝒞Ω\hat{c}\in\mathcal{C}\setminus\mathcal{C}_{\Omega}over^ start_ARG italic_c end_ARG ∈ caligraphic_C ∖ caligraphic_C start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT, which we denote as the reference conditional for c𝑐citalic_c. Formally,

pG(|c)=pG(|c^) if c𝒞Ω, otherwise pG(|c).p_{G^{\prime}}(\cdot|c)=p_{G}(\cdot|\hat{c})\text{ if }c\in\mathcal{C}_{\Omega% }\text{, otherwise }p_{G}(\cdot|c).italic_p start_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ⋅ | italic_c ) = italic_p start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( ⋅ | over^ start_ARG italic_c end_ARG ) if italic_c ∈ caligraphic_C start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT , otherwise italic_p start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( ⋅ | italic_c ) . (1)

Next, we introduce an efficient way to achieve (1). Let H𝐻Hitalic_H be the (separate) conditioning network in the generator network G𝐺Gitalic_G. H𝐻Hitalic_H takes the conditional c𝑐citalic_c as input and computes conditional representation H(c)𝐻𝑐H(c)italic_H ( italic_c ), which is then fused into the main generative network (potentially at different layers). Our solution is to project the conditional representation H(c)superscript𝐻𝑐H^{\prime}(c)italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_c ) of the new conditioning network to H(c^)𝐻^𝑐H(\hat{c})italic_H ( over^ start_ARG italic_c end_ARG ):

H(c)=H(c^) if c𝒞Ω, otherwise H(c).superscript𝐻𝑐𝐻^𝑐 if 𝑐subscript𝒞Ω, otherwise 𝐻𝑐H^{\prime}(c)=H(\hat{c})\text{ if }c\in\mathcal{C}_{\Omega}\text{, otherwise }% H(c).italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_c ) = italic_H ( over^ start_ARG italic_c end_ARG ) if italic_c ∈ caligraphic_C start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT , otherwise italic_H ( italic_c ) . (2)

We provide an illustrative and analytical explanation to our method in Section IV. Specifically, we show (2) can be done explicitly if the model is conditioned on a few discrete labels and the conditioning network is affine. We then introduce methods for more complicated, real-world scenarios in Section V. For these models conditioned on continuous representations and with complicated architecture, we introduce distillation-based methods to approximately achieve (2).


IV Redacting Models Conditioned on Discrete Labels

In this section, we show for simple class-conditional models, there is an explicit formula to redact certain labels.

Redacting a single label. Suppose there are k𝑘kitalic_k labels: 𝒞={c1,,ck}𝒞subscript𝑐1subscript𝑐𝑘\mathcal{C}=\{c_{1},\cdots,c_{k}\}caligraphic_C = { italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }, where label j𝑗jitalic_j is to be redacted. We consider a common type of conditioning method: each label cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is represented by a k𝑘kitalic_k-dimensional embedding vector viksubscript𝑣𝑖superscript𝑘v_{i}\in\mathbb{R}^{k}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, and H𝐻Hitalic_H is an affine transformation whose output dimension rk𝑟𝑘r\geq kitalic_r ≥ italic_k. We assume the embedding vectors are linearly independent: 𝐬𝐩𝐚𝐧{v1,,vk}=k𝐬𝐩𝐚𝐧subscript𝑣1subscript𝑣𝑘superscript𝑘\mathbf{span}\{v_{1},\cdots,v_{k}\}=\mathbb{R}^{k}bold_span { italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } = blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT. A special case of this formulation is the conditioning method proposed by (mirza2014conditional, ), where each vi=𝐞isubscript𝑣𝑖subscript𝐞𝑖v_{i}=\mathbf{e}_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the one-hot vector with the i𝑖iitalic_i-th element =1absent1=1= 1, and is concatenated to the latent code.

Let H(v)=Mv𝐻𝑣𝑀𝑣H(v)=Mvitalic_H ( italic_v ) = italic_M italic_v, where Mr×k𝑀superscript𝑟𝑘M\in\mathbb{R}^{r\times k}italic_M ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_k end_POSTSUPERSCRIPT. The redaction problem is equivalent finding an Mr×ksuperscript𝑀superscript𝑟𝑘M^{\prime}\in\mathbb{R}^{r\times k}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_k end_POSTSUPERSCRIPT such that Mvi=Mvisuperscript𝑀subscript𝑣𝑖𝑀subscript𝑣𝑖M^{\prime}v_{i}=Mv_{i}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_M italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for ij𝑖𝑗i\neq jitalic_i ≠ italic_j and Mvj=MVjηjsuperscript𝑀subscript𝑣𝑗𝑀subscript𝑉𝑗subscript𝜂𝑗M^{\prime}v_{j}=MV_{-j}\eta_{-j}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_M italic_V start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT for an one-hot vector ηjk1subscript𝜂𝑗superscript𝑘1\eta_{-j}\in\mathbb{R}^{k-1}italic_η start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT, where Vj=[v1,,vj1,vj+1,,vk]k×(k1)subscript𝑉𝑗subscript𝑣1subscript𝑣𝑗1subscript𝑣𝑗1subscript𝑣𝑘superscript𝑘𝑘1V_{-j}=[v_{1},\cdots,v_{j-1},v_{j+1},\cdots,v_{k}]\in\mathbb{R}^{k\times(k-1)}italic_V start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT = [ italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_v start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT , ⋯ , italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × ( italic_k - 1 ) end_POSTSUPERSCRIPT. The first condition Mvi=Mvisuperscript𝑀subscript𝑣𝑖𝑀subscript𝑣𝑖M^{\prime}v_{i}=Mv_{i}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_M italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for ij𝑖𝑗i\neq jitalic_i ≠ italic_j indicates every row of MMsuperscript𝑀𝑀M^{\prime}-Mitalic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_M is in the null space of {vi}ijsubscriptsubscript𝑣𝑖𝑖𝑗\{v_{i}\}_{i\neq j}{ italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT. The null space is a one-dimensional subspace with basis vector u𝑢uitalic_u. Then, MMsuperscript𝑀𝑀M^{\prime}-Mitalic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_M can be decomposed as ωu𝜔superscript𝑢top\omega u^{\top}italic_ω italic_u start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT for some ωr𝜔superscript𝑟\omega\in\mathbb{R}^{r}italic_ω ∈ blackboard_R start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT. Then, by to the second condition Mvj=MVjηjsuperscript𝑀subscript𝑣𝑗𝑀subscript𝑉𝑗subscript𝜂𝑗M^{\prime}v_{j}=MV_{-j}\eta_{-j}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_M italic_V start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT, we have ω=1uvjM(Vjηjvj)𝜔1superscript𝑢topsubscript𝑣𝑗𝑀subscript𝑉𝑗subscript𝜂𝑗subscript𝑣𝑗\omega=\frac{1}{u^{\top}v_{j}}M(V_{-j}\eta_{-j}-v_{j})italic_ω = divide start_ARG 1 end_ARG start_ARG italic_u start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG italic_M ( italic_V start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). This means by replacing M𝑀Mitalic_M with M=M(I+1uvj(Vjηjvj)u)superscript𝑀𝑀𝐼1superscript𝑢topsubscript𝑣𝑗subscript𝑉𝑗subscript𝜂𝑗subscript𝑣𝑗superscript𝑢topM^{\prime}=M(I+\frac{1}{u^{\top}v_{j}}(V_{-j}\eta_{-j}-v_{j})u^{\top})italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_M ( italic_I + divide start_ARG 1 end_ARG start_ARG italic_u start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ( italic_V start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_u start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ), we are able to redact label j𝑗jitalic_j. When conditioned on j𝑗jitalic_j, the edited model will generate another digit based on which element in ηjsubscript𝜂𝑗\eta_{-j}italic_η start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT is non-zero.

Redacting multiple labels. Suppose there are multiple labels {1,,J}1𝐽\{1,\cdots,J\}{ 1 , ⋯ , italic_J } (J<k𝐽𝑘J<kitalic_J < italic_k) to be redacted. The Msuperscript𝑀M^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT matrix needs to satisfy Mvi=Mvisuperscript𝑀subscript𝑣𝑖𝑀subscript𝑣𝑖M^{\prime}v_{i}=Mv_{i}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_M italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for i>J𝑖𝐽i>Jitalic_i > italic_J, and Mvj=MVJηjsuperscript𝑀subscript𝑣𝑗𝑀subscript𝑉𝐽subscript𝜂𝑗M^{\prime}v_{j}=MV_{-J}\eta_{-j}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_M italic_V start_POSTSUBSCRIPT - italic_J end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT - italic_j end_POSTSUBSCRIPT for jJ𝑗𝐽j\leq Jitalic_j ≤ italic_J, where VJ=[vJ+1,,vk]k×(kJ)subscript𝑉𝐽subscript𝑣𝐽1subscript𝑣𝑘superscript𝑘𝑘𝐽V_{-J}=[v_{J+1},\cdots,v_{k}]\in\mathbb{R}^{k\times(k-J)}italic_V start_POSTSUBSCRIPT - italic_J end_POSTSUBSCRIPT = [ italic_v start_POSTSUBSCRIPT italic_J + 1 end_POSTSUBSCRIPT , ⋯ , italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × ( italic_k - italic_J ) end_POSTSUPERSCRIPT. For jJ𝑗𝐽j\leq Jitalic_j ≤ italic_J, let ujsubscript𝑢𝑗u_{j}italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT be the basis vector of the null space of {vi}ijsubscriptsubscript𝑣𝑖𝑖𝑗\{v_{i}\}_{i\neq j}{ italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT. Each row of MMsuperscript𝑀𝑀M^{\prime}-Mitalic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_M is in the null space of {vi}i>Jsubscriptsubscript𝑣𝑖𝑖𝐽\{v_{i}\}_{i>J}{ italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i > italic_J end_POSTSUBSCRIPT, which can be written as a linear combination of {uj}j=1Jsuperscriptsubscriptsubscript𝑢𝑗𝑗1𝐽\{u_{j}\}_{j=1}^{J}{ italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT. Therefore, we can represent MMsuperscript𝑀𝑀M^{\prime}-Mitalic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_M as

MM=j=1Jωjuj=WU,superscript𝑀𝑀superscriptsubscript𝑗1𝐽subscript𝜔𝑗superscriptsubscript𝑢𝑗top𝑊superscript𝑈topM^{\prime}-M=\sum_{j=1}^{J}\omega_{j}u_{j}^{\top}=WU^{\top},italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_M = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = italic_W italic_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ,

where the j𝑗jitalic_j-th column of W𝑊Witalic_W (U𝑈Uitalic_U) is ωjsubscript𝜔𝑗\omega_{j}italic_ω start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (ujsubscript𝑢𝑗u_{j}italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT). Let VJ=[v1,,vJ]subscript𝑉𝐽subscript𝑣1subscript𝑣𝐽V_{J}=[v_{1},\cdots,v_{J}]italic_V start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT = [ italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_v start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ] and YJ=[η1,,ηJ]subscript𝑌𝐽subscript𝜂1subscript𝜂𝐽Y_{-J}=[\eta_{-1},\cdots,\eta_{-J}]italic_Y start_POSTSUBSCRIPT - italic_J end_POSTSUBSCRIPT = [ italic_η start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT , ⋯ , italic_η start_POSTSUBSCRIPT - italic_J end_POSTSUBSCRIPT ]. We have MVJ=MVJYJsuperscript𝑀subscript𝑉𝐽𝑀subscript𝑉𝐽subscript𝑌𝐽M^{\prime}V_{J}=MV_{-J}Y_{-J}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT = italic_M italic_V start_POSTSUBSCRIPT - italic_J end_POSTSUBSCRIPT italic_Y start_POSTSUBSCRIPT - italic_J end_POSTSUBSCRIPT. This simplifies to

WUVJ=M(VJYJVJ).𝑊superscript𝑈topsubscript𝑉𝐽𝑀subscript𝑉𝐽subscript𝑌𝐽subscript𝑉𝐽WU^{\top}V_{J}=M(V_{-J}Y_{-J}-V_{J}).italic_W italic_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT = italic_M ( italic_V start_POSTSUBSCRIPT - italic_J end_POSTSUBSCRIPT italic_Y start_POSTSUBSCRIPT - italic_J end_POSTSUBSCRIPT - italic_V start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ) .

Notice that UVJsuperscript𝑈topsubscript𝑉𝐽U^{\top}V_{J}italic_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT is a diagonal matrix with j𝑗jitalic_j-th diagonal element ujvj0superscriptsubscript𝑢𝑗topsubscript𝑣𝑗0u_{j}^{\top}v_{j}\neq 0italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≠ 0. Therefore, we have

W=M(VJYJVJ)(UVJ)1.𝑊𝑀subscript𝑉𝐽subscript𝑌𝐽subscript𝑉𝐽superscriptsuperscript𝑈topsubscript𝑉𝐽1W=M(V_{-J}Y_{-J}-V_{J})(U^{\top}V_{J})^{-1}.italic_W = italic_M ( italic_V start_POSTSUBSCRIPT - italic_J end_POSTSUBSCRIPT italic_Y start_POSTSUBSCRIPT - italic_J end_POSTSUBSCRIPT - italic_V start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ) ( italic_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT . (3)

Simplified formula for one-hot embedding vectors. Let vi=𝐞isubscript𝑣𝑖subscript𝐞𝑖v_{i}=\mathbf{e}_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each i𝑖iitalic_i. Then, we have ui=vi=𝐞isubscript𝑢𝑖subscript𝑣𝑖subscript𝐞𝑖u_{i}=v_{i}=\mathbf{e}_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and therefore UVJ=Isuperscript𝑈topsubscript𝑉𝐽𝐼U^{\top}V_{J}=Iitalic_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT = italic_I. We also have U=VJ=[IJ|𝟎]𝑈subscript𝑉𝐽superscriptdelimited-[]conditionalsubscript𝐼𝐽0topU=V_{J}=[I_{J}|\mathbf{0}]^{\top}italic_U = italic_V start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT = [ italic_I start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT | bold_0 ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT and VJ=[𝟎|IkJ]subscript𝑉𝐽delimited-[]conditional0subscript𝐼𝑘𝐽V_{-J}=[\mathbf{0}|I_{k-J}]italic_V start_POSTSUBSCRIPT - italic_J end_POSTSUBSCRIPT = [ bold_0 | italic_I start_POSTSUBSCRIPT italic_k - italic_J end_POSTSUBSCRIPT ], where IJsubscript𝐼𝐽I_{J}italic_I start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT is the J𝐽Jitalic_J-dimensional identity matrix. Then,

WU=M([𝟎|IkJ]YJ[IJ|𝟎])[IJ|𝟎]=M(IJ𝟎YJ𝟎).𝑊superscript𝑈top𝑀delimited-[]conditional0subscript𝐼𝑘𝐽subscript𝑌𝐽superscriptdelimited-[]conditionalsubscript𝐼𝐽0topdelimited-[]conditionalsubscript𝐼𝐽0𝑀subscript𝐼𝐽0subscript𝑌𝐽0WU^{\top}=M([\mathbf{0}|I_{k-J}]Y_{-J}-[I_{J}|\mathbf{0}]^{\top})[I_{J}|% \mathbf{0}]=M\left(\begin{array}[]{cc}-I_{J}&\mathbf{0}\\ Y_{-J}&\mathbf{0}\end{array}\right).italic_W italic_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = italic_M ( [ bold_0 | italic_I start_POSTSUBSCRIPT italic_k - italic_J end_POSTSUBSCRIPT ] italic_Y start_POSTSUBSCRIPT - italic_J end_POSTSUBSCRIPT - [ italic_I start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT | bold_0 ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) [ italic_I start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT | bold_0 ] = italic_M ( start_ARRAY start_ROW start_CELL - italic_I start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT end_CELL start_CELL bold_0 end_CELL end_ROW start_ROW start_CELL italic_Y start_POSTSUBSCRIPT - italic_J end_POSTSUBSCRIPT end_CELL start_CELL bold_0 end_CELL end_ROW end_ARRAY ) .

As a result,

M=M+WU=M(𝟎𝟎YJIkJ).superscript𝑀𝑀𝑊superscript𝑈top𝑀00subscript𝑌𝐽subscript𝐼𝑘𝐽M^{\prime}=M+WU^{\top}=M\left(\begin{array}[]{cc}\mathbf{0}&\mathbf{0}\\ Y_{-J}&I_{k-J}\end{array}\right).italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_M + italic_W italic_U start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = italic_M ( start_ARRAY start_ROW start_CELL bold_0 end_CELL start_CELL bold_0 end_CELL end_ROW start_ROW start_CELL italic_Y start_POSTSUBSCRIPT - italic_J end_POSTSUBSCRIPT end_CELL start_CELL italic_I start_POSTSUBSCRIPT italic_k - italic_J end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ) .

Higher embedding dimension. Because of linear independence, the null space of {vi}ijsubscriptsubscript𝑣𝑖𝑖𝑗\{v_{i}\}_{i\neq j}{ italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT has 1 dimension higher than the null space of {vi}i=1ksuperscriptsubscriptsubscript𝑣𝑖𝑖1𝑘\{v_{i}\}_{i=1}^{k}{ italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT. Therefore, we can pick uj𝐧𝐮𝐥𝐥({vi}ij)𝐧𝐮𝐥𝐥({vi}i=1k)subscript𝑢𝑗𝐧𝐮𝐥𝐥subscriptsubscript𝑣𝑖𝑖𝑗𝐧𝐮𝐥𝐥superscriptsubscriptsubscript𝑣𝑖𝑖1𝑘u_{j}\in\mathbf{null}(\{v_{i}\}_{i\neq j})\setminus\mathbf{null}(\{v_{i}\}_{i=% 1}^{k})italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ bold_null ( { italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ≠ italic_j end_POSTSUBSCRIPT ) ∖ bold_null ( { italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ).


V Redacting Models Conditioned on Continuous Representations

In practice, the networks are usually complicated and highly non-linear. Therefore, there is generally no explicit formula to achieve (2) due to non-linearity and limited expressive power of the conditioning network. To approximately achieve (2), we propose to distill the conditioning network by minimizing

minHL(H;λ)=𝔼c𝒞𝒞ΩH(c)H(c)+λ𝔼c𝒞ΩH(c)H(c^)subscriptsuperscript𝐻𝐿superscript𝐻𝜆absentsubscript𝔼𝑐𝒞subscript𝒞Ωnormsuperscript𝐻𝑐𝐻𝑐missing-subexpression𝜆subscript𝔼𝑐subscript𝒞Ωnormsuperscript𝐻𝑐𝐻^𝑐\begin{array}[]{rl}\min_{H^{\prime}}~{}L(H^{\prime};\lambda)=&\mathbb{E}_{c\in% \mathcal{C}\setminus\mathcal{C}_{\Omega}}\|H^{\prime}(c)-H(c)\|\\ &+\lambda\cdot\mathbb{E}_{c\in\mathcal{C}_{\Omega}}\|H^{\prime}(c)-H(\hat{c})% \|\end{array}start_ARRAY start_ROW start_CELL roman_min start_POSTSUBSCRIPT italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_L ( italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_λ ) = end_CELL start_CELL blackboard_E start_POSTSUBSCRIPT italic_c ∈ caligraphic_C ∖ caligraphic_C start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_c ) - italic_H ( italic_c ) ∥ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + italic_λ ⋅ blackboard_E start_POSTSUBSCRIPT italic_c ∈ caligraphic_C start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_c ) - italic_H ( over^ start_ARG italic_c end_ARG ) ∥ end_CELL end_ROW end_ARRAY (4)

for some metric \|\cdot\|∥ ⋅ ∥ and balancing coefficient λ>0𝜆0\lambda>0italic_λ > 0. In the rest of this section, we study two types of common conditional generative models: image models conditioned on text prompts, and speech models conditioned on spectrogram representations. We will demonstrate specific losses and distillation techniques for each model that align with the slightly different goals in each task.

V-A Redacting GAN-based Text-to-Image Models

In this section, we study how to redact text prompts in text-to-image models. Modern text-to-image models can produce high-resolution images conditioned on text prompts that may be offensive, biased, malignant, or fabricated (nichol2021glide, ; birhane2021multimodal, ; schuhmann2022laion, ; ramesh2022hierarchical, ; rando2022red, ; nudenet, ; man, ). These models are usually expensive to re-train, so it is important to redact these prompts without re-training.

Especially, we look at DM-GAN (zhu2019dm, ), a GAN-based text-to-image model. It is trained on pairs of text and images from the CUB dataset (CUB, ; reed2016learning, ), a dataset for various species of birds. DM-GAN is composed of three cascaded generative networks {G1,G2,G3}subscript𝐺1subscript𝐺2subscript𝐺3\{G_{1},G_{2},G_{3}\}{ italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT }. The first G1subscript𝐺1G_{1}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT generates 64×64646464\times 6464 × 64 images, the second G2subscript𝐺2G_{2}italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT up-samples to 128×128128128128\times 128128 × 128, and the third G3subscript𝐺3G_{3}italic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT up-samples to 256×256256256256\times 256256 × 256. Each Gisubscript𝐺𝑖G_{i}italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has its own conditioning network Hisubscript𝐻𝑖H_{i}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. For a given prompt c𝑐citalic_c, the model computes a sentence embedding vs(c)subscript𝑣𝑠𝑐v_{s}(c)italic_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_c ) and word embeddings vw(c)subscript𝑣𝑤𝑐v_{w}(c)italic_v start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_c ) from a pre-trained text encoder (xu2018attngan, ). The first conditioning network H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT performs conditioning augmentation on the sentence embedding and concatenate the output to the latent variable. H2subscript𝐻2H_{2}italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and H3subscript𝐻3H_{3}italic_H start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT apply memory writing modules to the word embeddings and fuse the outputs with the previously generated low-resolution images via several gates.

Defining c^normal-^𝑐\hat{c}over^ start_ARG italic_c end_ARG. We assume 𝒞Ωsubscript𝒞Ω\mathcal{C}_{\Omega}caligraphic_C start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT contains prompts that have undesirable words or phrases. For these prompts, the reference prompts are defined by replacing these words with non-redacted ones.

Sequential distillation. We propose to distill the conditioning networks {H1,H2,H3}subscript𝐻1subscript𝐻2subscript𝐻3\{H_{1},H_{2},H_{3}\}{ italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT } sequentially based on (4). This is because both G2subscript𝐺2G_{2}italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and G3subscript𝐺3G_{3}italic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are generative super-sampling networks, which take G1subscript𝐺1G_{1}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and G2subscript𝐺2G_{2}italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT outputs as inputs, respectively. After G1subscript𝐺1G_{1}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is edited to G1superscriptsubscript𝐺1G_{1}^{\prime}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for redaction, G2subscript𝐺2G_{2}italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT will take G1superscriptsubscript𝐺1G_{1}^{\prime}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT outputs as inputs, and similar for G3subscript𝐺3G_{3}italic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. Formally,

H1=argminH1{𝔼c𝒞𝒞ΩH1(vs(c))H1(vs(c))+λ𝔼c𝒞ΩH1(vs(c))H1(vs(c^))},\begin{array}[]{rl}\displaystyle H_{1}^{\prime}=\arg\min_{H_{1}^{\prime}}&\{% \mathbb{E}_{c\in\mathcal{C}\setminus\mathcal{C}_{\Omega}}\|H_{1}^{\prime}(v_{s% }(c))-H_{1}(v_{s}(c))\|\\ &+\lambda\cdot\mathbb{E}_{c\in\mathcal{C}_{\Omega}}\|H_{1}^{\prime}(v_{s}(c))-% H_{1}(v_{s}({\color[rgb]{1,0,0}\hat{c}}))\|\},\end{array}start_ARRAY start_ROW start_CELL italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL { blackboard_E start_POSTSUBSCRIPT italic_c ∈ caligraphic_C ∖ caligraphic_C start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_c ) ) - italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_c ) ) ∥ end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + italic_λ ⋅ blackboard_E start_POSTSUBSCRIPT italic_c ∈ caligraphic_C start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_c ) ) - italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( over^ start_ARG italic_c end_ARG ) ) ∥ } , end_CELL end_ROW end_ARRAY (5)
Hi=argminHi{𝔼c𝒞𝒞Ω,zHi(vw(c),Gi1(z|c))Hi(vw(c),Gi1(z|c))+λ𝔼c𝒞Ω,zHi(vw(c),Gi1(z|c))Hi(vw(c^),Gi1(z|c^))}\begin{array}[]{rll}\displaystyle H_{i}^{\prime}=\arg\min_{H_{i}^{\prime}}&\{% \mathbb{E}_{c\in\mathcal{C}\setminus\mathcal{C}_{\Omega},z}&\|H_{i}^{\prime}(v% _{w}(c),G_{i-1}^{\prime}(z|c))-\\ &&~{}~{}~{}~{}~{}~{}H_{i}(v_{w}(c),G_{i-1}^{\prime}(z|c))\|\\ \displaystyle+&\lambda\cdot\mathbb{E}_{c\in\mathcal{C}_{\Omega},z}&\|H_{i}^{% \prime}(v_{w}(c),G_{i-1}^{\prime}(z|c))-\\ &&~{}~{}~{}~{}~{}~{}H_{i}(v_{w}({\color[rgb]{1,0,0}\hat{c}}),G_{i-1}^{\prime}(% z|{\color[rgb]{1,0,0}\hat{c}}))\|\}\end{array}start_ARRAY start_ROW start_CELL italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL { blackboard_E start_POSTSUBSCRIPT italic_c ∈ caligraphic_C ∖ caligraphic_C start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT , italic_z end_POSTSUBSCRIPT end_CELL start_CELL ∥ italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_v start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_c ) , italic_G start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z | italic_c ) ) - end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_c ) , italic_G start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z | italic_c ) ) ∥ end_CELL end_ROW start_ROW start_CELL + end_CELL start_CELL italic_λ ⋅ blackboard_E start_POSTSUBSCRIPT italic_c ∈ caligraphic_C start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT , italic_z end_POSTSUBSCRIPT end_CELL start_CELL ∥ italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_v start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_c ) , italic_G start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z | italic_c ) ) - end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( over^ start_ARG italic_c end_ARG ) , italic_G start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z | over^ start_ARG italic_c end_ARG ) ) ∥ } end_CELL end_ROW end_ARRAY (6)

for i=2,3𝑖23i=2,3italic_i = 2 , 3.

Improved capacity. As H1superscriptsubscript𝐻1H_{1}^{\prime}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT needs to approximate a piecewise function that is defined differently for two sets of sentence embeddings, we need to increase the capacity of H1superscriptsubscript𝐻1H_{1}^{\prime}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for better distillation. We append a few LSTM layers to the beginning of H1superscriptsubscript𝐻1H_{1}^{\prime}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, which directly take the sentence embeddings as inputs. The LSTM layers are followed by a convolution layer that reduces hidden dimensions to 1. We initialize this layer with zero weights for training stability. We expect these layers can project sentence embeddings of c𝑐citalic_c to those of c^^𝑐\hat{c}over^ start_ARG italic_c end_ARG. The rest of H1superscriptsubscript𝐻1H_{1}^{\prime}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT has the same architecture as H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT but all weights are initialized for training. We do not increase the capacity of H2superscriptsubscript𝐻2H_{2}^{\prime}italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and H3superscriptsubscript𝐻3H_{3}^{\prime}italic_H start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for two reasons. First, H1superscriptsubscript𝐻1H_{1}^{\prime}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT has more direct impact on the generated images because it directly controls the initial low-resolution image. Second, the memory writing modules of H2subscript𝐻2H_{2}italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and H3subscript𝐻3H_{3}italic_H start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are already very expressive.

Fixing the variance prediction part in H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. We aim to reduce the computational overhead by fixing certain variables. The conditioning augmentation module in H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT first computes a mean and a variance vector, and then samples from the Gaussian defined by them. We fix the variance prediction part and only distill the mean prediction part. In our experiments the number of parameters to be trained in H1superscriptsubscript𝐻1H_{1}^{\prime}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (with improved capacity) is reduced by 32%similar-toabsentpercent32\sim 32\%∼ 32 % and therefore matches H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

λ𝜆\lambdaitalic_λ annealing. In order to make sure the distilled conditioning networks also approximate the pre-trained ones well for non-redacted prompts, we anneal the balancing coefficient λ𝜆\lambdaitalic_λ during distillation: we initialize λ=λmin𝜆subscript𝜆\lambda=\lambda_{\min}italic_λ = italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT and linearly increases to λmaxsubscript𝜆\lambda_{\max}italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT in the end.

V-B Redacting Diffusion-based Text-to-Speech Models

Modern text-to-speech models can turn text into high-quality speech in unseen voices such as celebrity voices (kong2021diffwave, ; Betker2022TTS, ; wang2023neural, ; zhang2023speak, ). This may have unpredictable public impact if these models are used to fake celebrities. In this section, we study redacting certain voices from a pre-trained text-to-speech model.

Especially, we look at DiffWave (kong2021diffwave, ), a diffusion probabilistic model that is conditioned on spectrogram and outputs waveform. It is trained on speech of a single female reading a book, which we call the pre-trained voice. There are n=30𝑛30n=30italic_n = 30 layers or residual blocks in DiffWave, each containing one independent conditioning network Hisubscript𝐻𝑖H_{i}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The architecture of each Hisubscript𝐻𝑖H_{i}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT includes two up-sampling layers followed by one convolution layer.

Defining c^normal-^𝑐\hat{c}over^ start_ARG italic_c end_ARG with voice cloning. We assume 𝒞Ωsubscript𝒞Ω\mathcal{C}_{\Omega}caligraphic_C start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT contains a few clips of speech in a specific voice. We train a voice cloning model (CycleGAN-VC2 (kaneko2019cyclegan, )) between the specific and pre-trained voices, and then transform all clips in 𝒞Ωsubscript𝒞Ω\mathcal{C}_{\Omega}caligraphic_C start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT to the pre-trained voice. By doing this we obtain time-aligned pairs between c𝒞Ω𝑐subscript𝒞Ωc\in\mathcal{C}_{\Omega}italic_c ∈ caligraphic_C start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT and the corresponding c^^𝑐\hat{c}over^ start_ARG italic_c end_ARG: when we select a small duration [t,t+Δt]𝑡𝑡Δ𝑡[t,t+\Delta t][ italic_t , italic_t + roman_Δ italic_t ], the content of ct:t+Δtsubscript𝑐:𝑡𝑡Δ𝑡c_{t:t+\Delta t}italic_c start_POSTSUBSCRIPT italic_t : italic_t + roman_Δ italic_t end_POSTSUBSCRIPT is the same as c^t:t+Δtsubscript^𝑐:𝑡𝑡Δ𝑡\hat{c}_{t:t+\Delta t}over^ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_t : italic_t + roman_Δ italic_t end_POSTSUBSCRIPT, yet only the voices are different.

Improved voice cloning. We find the voice cloning quality of CycleGAN-VC2 can be improved by making the two unpaired training sets more similar. We first use a pre-trained Whisper model (radford2022robust, ) to extract text from redacted speech. Then, we use Tortoise-TTS (Betker2022TTS, ) to turn these text into speech in the pre-trained voice. Note that this cannot be used to define c^^𝑐\hat{c}over^ start_ARG italic_c end_ARG directly because the generated samples are not time-aligned with the speech to be redacted. However, these generated samples are more similar to the redacted samples because they have the same text, and therefore it is easier for CycleGAN-VC2 to learn transformations between these two voices.

Parallel distillation. We propose to distill all conditional layers Hisubscript𝐻𝑖H_{i}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s in parallel as they are independent. We minimize the following loss: min1ni=1nL(Hi;λ).1𝑛superscriptsubscript𝑖1𝑛𝐿superscriptsubscript𝐻𝑖𝜆\min\frac{1}{n}\sum_{i=1}^{n}L(H_{i}^{\prime};\lambda).roman_min divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_L ( italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_λ ) .

Fixing up-sampling layers in Hisubscript𝐻𝑖H_{i}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. To reduce computation overhead we fix the two up-sampling layers in each Hisubscript𝐻𝑖H_{i}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We only distill the last convolution layer in each Hisubscript𝐻𝑖H_{i}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Improved capacity. To improve redaction quality, we increase the capacity of each Hisuperscriptsubscript𝐻𝑖H_{i}^{\prime}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT by replacing its last convolution layer hconvsubscriptconvh_{\mathrm{conv}}italic_h start_POSTSUBSCRIPT roman_conv end_POSTSUBSCRIPT with a spectrogram-rewriting module. It has two components: a gate hgatesubscriptgateh_{\mathrm{gate}}italic_h start_POSTSUBSCRIPT roman_gate end_POSTSUBSCRIPT consisting of a convolution with zero initialization followed by sigmoid, and a transformation block htranssubscripttransh_{\mathrm{trans}}italic_h start_POSTSUBSCRIPT roman_trans end_POSTSUBSCRIPT consisting of two convolution layers. The forward computation of the spectrogram-rewriting module is defined as: y=hconv(v)hgate(v)+hconv(htrans(v))(1hgate(v)),𝑦direct-productsubscriptconv𝑣subscriptgate𝑣direct-productsubscriptconvsubscripttrans𝑣1subscriptgate𝑣y=h_{\mathrm{conv}}(v)\odot h_{\mathrm{gate}}(v)+h_{\mathrm{conv}}(h_{\mathrm{% trans}}(v))\odot(1-h_{\mathrm{gate}}(v)),italic_y = italic_h start_POSTSUBSCRIPT roman_conv end_POSTSUBSCRIPT ( italic_v ) ⊙ italic_h start_POSTSUBSCRIPT roman_gate end_POSTSUBSCRIPT ( italic_v ) + italic_h start_POSTSUBSCRIPT roman_conv end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT roman_trans end_POSTSUBSCRIPT ( italic_v ) ) ⊙ ( 1 - italic_h start_POSTSUBSCRIPT roman_gate end_POSTSUBSCRIPT ( italic_v ) ) , where v𝑣vitalic_v is the up-sampled mel-spectrogram and y𝑦yitalic_y is the output representation at each layer. We expect this module can retain the pre-trained voice and also project redacted voices to the pre-trained voice.

Non-uniform distillation losses. We conjecture the all conditioning layers are not of the same importance because of their order and different hyper-parameters specifically the dilation 2imodnsuperscript2𝑖modsuperscript𝑛2^{i\mathrm{~{}mod~{}}n^{\prime}}2 start_POSTSUPERSCRIPT italic_i roman_mod italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT in the corresponding residual layer. This motivates us to use different weights and λ𝜆\lambdaitalic_λ values for each Hisubscript𝐻𝑖H_{i}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT: mini=1nwiL(Hi;λi).superscriptsubscript𝑖1𝑛subscript𝑤𝑖𝐿superscriptsubscript𝐻𝑖subscript𝜆𝑖\min\sum_{i=1}^{n}w_{i}L(H_{i}^{\prime};\lambda_{i}).roman_min ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_L ( italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . We test different schedules described in Table I.

TABLE I: Schedules for the non-uniform distillation losses.
name schedule
wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT-order wi=1n+α(i(n+1)/2)subscript𝑤𝑖1𝑛𝛼𝑖𝑛12w_{i}=\frac{1}{n}+\alpha(i-(n+1)/2)italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG + italic_α ( italic_i - ( italic_n + 1 ) / 2 )
λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT-order λi=λ+β(i(n+1)/2)subscript𝜆𝑖𝜆𝛽𝑖𝑛12\lambda_{i}=\lambda+\beta(i-(n+1)/2)italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_λ + italic_β ( italic_i - ( italic_n + 1 ) / 2 )
wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT-dilation wi=1n+α(imodn(n+1)/6)subscript𝑤𝑖1𝑛𝛼𝑖modsuperscript𝑛superscript𝑛16w_{i}=\frac{1}{n}+\alpha(i\mathrm{~{}mod~{}}n^{\prime}-(n^{\prime}+1)/6)italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG + italic_α ( italic_i roman_mod italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - ( italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 ) / 6 )
λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT-dilation λi=λ+β(imodn(n+1)/6)subscript𝜆𝑖𝜆𝛽𝑖modsuperscript𝑛superscript𝑛16\lambda_{i}=\lambda+\beta(i\mathrm{~{}mod~{}}n^{\prime}-(n^{\prime}+1)/6)italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_λ + italic_β ( italic_i roman_mod italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - ( italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 1 ) / 6 )

VI Experiments

In this section, we aim to answer the following questions. (1) Is the redaction method in Section IV able to fully redact labels? And (2) do the redaction algorithms in Section V redact certain conditionals well and retain high generation quality on real-world applications?

VI-A Redacting Models Conditioned on Discrete Labels

We train a class-conditional GAN called cGAN (mirza2014conditional, ) on MNIST (lecun2010mnist, ). Each conditional has a 10101010-dimensional embedding vector, and is concatenated to the latent vector as the input. The affine transformation matrix M𝑀Mitalic_M in Section IV is the last 10 rows of the weight matrix of the first fully connected layer. We redact labels 0,1,2,3 according to (3), where we let c^=9c^𝑐9𝑐\hat{c}=9-cover^ start_ARG italic_c end_ARG = 9 - italic_c for them. Generated samples of pre-trained and redacted models are shown in Fig. 2.

Refer to caption
Refer to caption
Figure 2: Redacting labels 0,1,2,3 in cGAN on MNIST. Upper: samples generated from the pre-trained model. Down: samples generated from the redacted model. Redacted conditionals (first two rows) are edited as expected, and other conditionals (last three rows) remain unchanged.

VI-B Redacting GAN-based Text-to-Image Models

Setup. We use the pre-trained DM-GAN (zhu2019dm, ) model trained on the CUB dataset (CUB, ), which contains 8855 training images and 2933 testing images of 200 subcategories belonging to birds. Each image has 10 captions (reed2016learning, ). Our distillation algorithm is trained with the caption data only. We redact prompts that contain certain words or phrases. We redact the word bluecabsent𝑐\in c∈ italic_c by defining c^^𝑐\hat{c}over^ start_ARG italic_c end_ARG as the prompt that replaces all blue with another word red. 555Any word other than blue can be used. Similarly, we redact blue wings and red wings by replacing these phrases to white wings. We redact long beak and white belly by replacing the first to short beak and the second to black belly. Finally, we redact yellow and red by replacing them to black, which is more challenging as many samples are redacted.

Table II includes the number of training and test prompts that are redacted in each experiment. Note that when we redact blue wings and red wings, we also redact phrases wings that are blue and wings that are red.

TABLE II: Number of redacted training and test prompts. There are 88550 training prompts and 29330 test prompts in total.
Redaction prompts # redacted training prompts # redacted test prompts
long beak, white belly 10377 3369
blue / red wings 732 303
blue 6113 2175
yellow, red 29514 9319

Architecture and optimization. The architecture of the pre-trained model and other details are in Appendix B. The architecture of student conditioning networks with improved capacity is shown in Fig. 4. For each Hisubscript𝐻𝑖H_{i}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, i=1,2,3𝑖123i=1,2,3italic_i = 1 , 2 , 3, we use the Adam optimizer (kingma2014adam, ) with a learning rate 0.0050.0050.0050.005 to optimize the mean square error loss. The redaction algorithm terminates at 1000 iterations. For H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT we use a batch size of 128, and for H2subscript𝐻2H_{2}italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and H3subscript𝐻3H_{3}italic_H start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT we reduce the batch size to 32 in order to fit into GPU memory.

Configurations. We first use the sequential distillation (5) and (6) with λ=1𝜆1\lambda=1italic_λ = 1 to perform redaction, which we denote as the base configuration. We then improve the capacity by using a 3-layer bidirectional LSTM with hidden size =32absent32=32= 32 and dropout rate =0.1absent0.1=0.1= 0.1. Next, we fix the variance prediction in H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to reduce the number of parameters to optimize, which matches the base configuration. Finally, we apply λ𝜆\lambdaitalic_λ annealing by setting λmin=1subscript𝜆1\lambda_{\min}=1italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT = 1 and λmax=3subscript𝜆3\lambda_{\max}=3italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 3.

Baseline. We compare to the Rewriting algorithm (bau2020rewriting, ), a semantic editing method originally designed for unconditional generative models. We adapt their method to DM-GAN by rewriting G1subscript𝐺1G_{1}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, G2subscript𝐺2G_{2}italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and G3subscript𝐺3G_{3}italic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT sequentially. For both G2subscript𝐺2G_{2}italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and G3subscript𝐺3G_{3}italic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT we rewrite the up-sampling layer before the feature output. For G1subscript𝐺1G_{1}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT we have choices of rewriting the up-sampling layer at different resolutions ranging from 8×8888\times 88 × 8 to 64×64646464\times 6464 × 64. We test all these choices in the experiment.

Evaluation metrics. To evaluate generation quality of Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we compute Inception Scores (IS) (salimans2016improved, ) for images conditioned on redacted and valid prompts, separately. In detail, the IS scores are computed as exp(𝔼x𝕂𝕃(p(y|x)p(y)))subscript𝔼𝑥𝕂𝕃conditional𝑝conditional𝑦𝑥𝑝𝑦\exp(\mathbb{E}_{x}\mathbb{KL}(p(y|x)\parallel p(y)))roman_exp ( blackboard_E start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT blackboard_K blackboard_L ( italic_p ( italic_y | italic_x ) ∥ italic_p ( italic_y ) ) ), where xpG(|c)x\sim p_{G^{\prime}}(\cdot|c)italic_x ∼ italic_p start_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ⋅ | italic_c ) for cUniform(𝒞Ω)similar-to𝑐Uniformsubscript𝒞Ωc\sim\mathrm{Uniform}(\mathcal{C}_{\Omega})italic_c ∼ roman_Uniform ( caligraphic_C start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ) or Uniform(𝒞𝒞Ω)Uniform𝒞subscript𝒞Ω\mathrm{Uniform}(\mathcal{C}\setminus\mathcal{C}_{\Omega})roman_Uniform ( caligraphic_C ∖ caligraphic_C start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT ), p(y|x)𝑝conditional𝑦𝑥p(y|x)italic_p ( italic_y | italic_x ) is the logit from the Inception-V3 output layer (szegedy2016rethinking, ), and p(y)𝑝𝑦p(y)italic_p ( italic_y ) is the marginal. We generate one sample for each text prompt for evaluation.

To evaluate redaction quality, we compute the following three metrics where c𝒞Ωsimilar-to𝑐subscript𝒞Ωc\sim\mathcal{C}_{\Omega}italic_c ∼ caligraphic_C start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT and z𝒩similar-to𝑧𝒩z\sim\mathcal{N}italic_z ∼ caligraphic_N.

  1. 1.

    G(|c/c^)\mathcal{R}_{G(\cdot|c/\hat{c})}caligraphic_R start_POSTSUBSCRIPT italic_G ( ⋅ | italic_c / over^ start_ARG italic_c end_ARG ) end_POSTSUBSCRIPT measures faithfulness of Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT on the redaction prompts. It is defined as the fraction of samples {G(z|c)}superscript𝐺conditional𝑧𝑐\{G^{\prime}(z|c)\}{ italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z | italic_c ) } such that dist(G(z|c),G(z|c^))<dist(G(z|c),G(z|c))distsuperscript𝐺conditional𝑧𝑐𝐺conditional𝑧^𝑐distsuperscript𝐺conditional𝑧𝑐𝐺conditional𝑧𝑐\mathrm{dist}(G^{\prime}(z|c),G(z|\hat{c}))<\mathrm{dist}(G^{\prime}(z|c),G(z|% c))roman_dist ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z | italic_c ) , italic_G ( italic_z | over^ start_ARG italic_c end_ARG ) ) < roman_dist ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z | italic_c ) , italic_G ( italic_z | italic_c ) ), where distdist\mathrm{dist}roman_dist is 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance in the Inception-V3 feature space (szegedy2016rethinking, ).

  2. 2.

    A modified R-precision score rsubscript𝑟\mathcal{R}_{r}caligraphic_R start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT measures how well G(z|c)superscript𝐺conditional𝑧𝑐G^{\prime}(z|c)italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z | italic_c ) matches the target caption c^^𝑐\hat{c}over^ start_ARG italic_c end_ARG. (xu2018attngan, ) defined correlation corr(x,c)corr𝑥𝑐\mathrm{corr}(x,c)roman_corr ( italic_x , italic_c ) between sample x𝑥xitalic_x and caption c𝑐citalic_c as cosEncCNN(x),EncRNN(c)subscriptEncCNN𝑥subscriptEncRNN𝑐\cos\langle\mathrm{Enc}_{\mathrm{CNN}}(x),\mathrm{Enc}_{\mathrm{RNN}}(c)\rangleroman_cos ⟨ roman_Enc start_POSTSUBSCRIPT roman_CNN end_POSTSUBSCRIPT ( italic_x ) , roman_Enc start_POSTSUBSCRIPT roman_RNN end_POSTSUBSCRIPT ( italic_c ) ⟩ for pretrained CNN (image) and RNN (text) encoders. We use the pretrained encoders from DM-GAN. Then, rsubscript𝑟\mathcal{R}_{r}caligraphic_R start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is defined as the fraction of samples G(z|c)superscript𝐺conditional𝑧𝑐G^{\prime}(z|c)italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z | italic_c ) such that corr(G(z|c),c^)corrsuperscript𝐺conditional𝑧𝑐^𝑐\mathrm{corr}(G^{\prime}(z|c),\hat{c})roman_corr ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z | italic_c ) , over^ start_ARG italic_c end_ARG ) is larger than the correlation between G(z|c)superscript𝐺conditional𝑧𝑐G^{\prime}(z|c)italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z | italic_c ) and 100 random, mismatch captions.

  3. 3.

    We further introduce c/c^subscript𝑐^𝑐\mathcal{R}_{c/\hat{c}}caligraphic_R start_POSTSUBSCRIPT italic_c / over^ start_ARG italic_c end_ARG end_POSTSUBSCRIPT, which measures how much better G(z|c)superscript𝐺conditional𝑧𝑐G^{\prime}(z|c)italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z | italic_c ) matches c^^𝑐\hat{c}over^ start_ARG italic_c end_ARG than c𝑐citalic_c. It is defined as the fraction of samples G(z|c)superscript𝐺conditional𝑧𝑐G^{\prime}(z|c)italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z | italic_c ) such that corr(G(z|c),c^)>corr(G(z|c),c)corrsuperscript𝐺conditional𝑧𝑐^𝑐corrsuperscript𝐺conditional𝑧𝑐𝑐\mathrm{corr}(G^{\prime}(z|c),\hat{c})>\mathrm{corr}(G^{\prime}(z|c),c)roman_corr ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z | italic_c ) , over^ start_ARG italic_c end_ARG ) > roman_corr ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z | italic_c ) , italic_c ).

Results. The results for redacting yellow and red shown in Table III. The base configuration already achieves good redaction and generation quality. After improving capacity, we find all redaction quality metrics increase by 2.32.7%similar-to2.3percent2.72.3\sim 2.7\%2.3 ∼ 2.7 %, and generation quality is retained. After we fix the variance prediction in H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, the redaction decrease by 1%similar-toabsentpercent1\sim 1\%∼ 1 %, but the generation quality on valid prompts increases by 0.10.10.10.1. Finally, by performing λ𝜆\lambdaitalic_λ annealing, all metrics improve. Notably, G(|c/c^)\mathcal{R}_{G(\cdot|c/\hat{c})}caligraphic_R start_POSTSUBSCRIPT italic_G ( ⋅ | italic_c / over^ start_ARG italic_c end_ARG ) end_POSTSUBSCRIPT and c/c^subscript𝑐^𝑐\mathcal{R}_{c/\hat{c}}caligraphic_R start_POSTSUBSCRIPT italic_c / over^ start_ARG italic_c end_ARG end_POSTSUBSCRIPT increase by over 5%percent55\%5 %, indicating generated samples are more similar to c^^𝑐\hat{c}over^ start_ARG italic_c end_ARG rather than c𝑐citalic_c.

We find the Rewriting baselines achieve better IS. However, generated samples are blurred and lack sharp edges as shown in the visualization. The redaction quality of Rewriting has a significant gap with ours: all redaction metrics are less than half of ours. Especially, rsubscript𝑟\mathcal{R}_{r}caligraphic_R start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is worse than the pre-trained model, indicating generated samples conditioned on redacted prompts are not very correlated to c^^𝑐\hat{c}over^ start_ARG italic_c end_ARG. We hypothesize the main problem for Rewriting is that it is crafted for 2D convolutions and edits the main generative network, which makes it hard to handle and distinguish the information from different prompts. In terms of different choices of resolutions, we find rewriting the layer at resolution 8×8888\times 88 × 8 yields the best redaction quality.

TABLE III: Generation and redaction quality after redacting yellow and red. Our method achieves significantly better redaction quality than Rewriting and retains good generation quality. The effects of each component within our method are displayed.
Method Inception Score (\uparrow) Redacting quality (\uparrow) Training time
redacted valid G(|c/c^)\mathcal{R}_{G(\cdot|c/\hat{c})}caligraphic_R start_POSTSUBSCRIPT italic_G ( ⋅ | italic_c / over^ start_ARG italic_c end_ARG ) end_POSTSUBSCRIPT c/c^subscript𝑐^𝑐\mathcal{R}_{c/\hat{c}}caligraphic_R start_POSTSUBSCRIPT italic_c / over^ start_ARG italic_c end_ARG end_POSTSUBSCRIPT rsubscript𝑟\mathcal{R}_{r}caligraphic_R start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT mins
Pre-trained 4.624.624.624.62 5.225.225.225.22 0%percent00\%0 % 6.0%percent6.06.0\%6.0 % 13.5%percent13.513.5\%13.5 % -
Rewriting 8×8888\times 88 × 8 5.575.575.575.57 5.525.525.525.52 33.0%percent33.0\textit{33.0}\%33.0 % 39.7%percent39.7\textit{39.7}\%39.7 % 5.0%percent5.0\textit{5.0}\%5.0 % 24.324.324.324.3
16×16161616\times 1616 × 16 5.635.635.635.63 5.535.535.535.53 30.4%percent30.430.4\%30.4 % 37.2%percent37.237.2\%37.2 % 4.8%percent4.84.8\%4.8 % 25.325.325.325.3
32×32323232\times 3232 × 32 5.725.725.725.72 5.715.715.715.71 28.8%percent28.828.8\%28.8 % 35.9%percent35.935.9\%35.9 % 4.7%percent4.74.7\%4.7 % 23.523.523.523.5
64×64646464\times 6464 × 64 5.775.77\mathbf{5.77}bold_5.77 5.735.73\mathbf{5.73}bold_5.73 27.5%percent27.527.5\%27.5 % 35.2%percent35.235.2\%35.2 % 4.6%percent4.64.6\%4.6 % 24.124.124.124.1
Ours (base) 4.794.794.794.79 5.235.235.235.23 65.1%percent65.165.1\%65.1 % 77.0%percent77.077.0\%77.0 % 46.9%percent46.946.9\%46.9 % 27.427.427.427.4
  + improved capacity 4.744.744.744.74 5.255.255.255.25 67.8%percent67.867.8\%67.8 % 79.7%percent79.779.7\%79.7 % 49.2%percent49.2\mathbf{49.2}\%bold_49.2 % 28.428.428.428.4
    + fix variance 4.794.794.794.79 5.355.355.355.35 66.5%percent66.566.5\%66.5 % 79.0%percent79.079.0\%79.0 % 48.4%percent48.448.4\%48.4 % 22.522.522.522.5
      + λ𝜆\lambdaitalic_λ annealing 4.84 5.36 72.2%percent72.2\mathbf{72.2}\%bold_72.2 % 84.2%percent84.2\mathbf{84.2}\%bold_84.2 % 49.2%percent49.2\mathbf{49.2}\%bold_49.2 % 29.929.929.929.9

Table IV includes results for redacting the other prompts. The Rewriting baseline is applied to 8×8888\times 88 × 8 resolution in H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT because it yields the best redaction quality. We find the base configuration of our method is already very effective. Our method greatly outperforms Rewriting in all redaction quality metrics and keeps good generation quality.

Visualization. See Appendix B-B for generated samples. The Rewriting baseline generate very blurry samples while our method generates high-quality, sharp samples which also satisfy the redaction requirements.

Computation. Data redaction takes about 30 minutes to train on a single NVIDIA 3080 GPU.

TABLE IV: Generation and redaction quality after redacting various words or phrases. Our method achieves significantly better redaction quality than Rewriting and retains good generation quality.
Redaction prompts Method Inception Score (\uparrow) Redacting quality (\uparrow) Training time
redacted valid G(|c/c^)\mathcal{R}_{G(\cdot|c/\hat{c})}caligraphic_R start_POSTSUBSCRIPT italic_G ( ⋅ | italic_c / over^ start_ARG italic_c end_ARG ) end_POSTSUBSCRIPT c/c^subscript𝑐^𝑐\mathcal{R}_{c/\hat{c}}caligraphic_R start_POSTSUBSCRIPT italic_c / over^ start_ARG italic_c end_ARG end_POSTSUBSCRIPT rsubscript𝑟\mathcal{R}_{r}caligraphic_R start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT mins
 long beak, white belly Pre-trained 4.144.144.144.14 5.615.615.615.61 0%percent00\%0 % 5.2%percent5.25.2\%5.2 % 13.1%percent13.113.1\%13.1 % -
Rewriting 5.365.36\mathbf{5.36}bold_5.36 5.855.85\mathbf{5.85}bold_5.85 32.6%percent32.632.6\%32.6 % 51.4%percent51.451.4\%51.4 % 5.6%percent5.65.6\%5.6 % 23.023.023.023.0
Ours (base) 4.914.914.914.91 5.815.815.815.81 70.5%percent70.5\mathbf{70.5}\%bold_70.5 % 83.6%percent83.6\mathbf{83.6}\%bold_83.6 % 50.1%percent50.1\mathbf{50.1}\%bold_50.1 % 28.328.328.328.3
blue / red wings Pre-trained 3.973.973.973.97 5.485.485.485.48 0%percent00\%0 % 4.1%percent4.14.1\%4.1 % 13.1%percent13.113.1\%13.1 % -
Rewriting 5.215.21\mathbf{5.21}bold_5.21 5.855.85\mathbf{5.85}bold_5.85 27.8%percent27.827.8\%27.8 % 15.1%percent15.115.1\%15.1 % 6.9%percent6.96.9\%6.9 % 23.323.323.323.3
Ours (base) 5.045.045.045.04 5.285.285.285.28 68.6%percent68.6\mathbf{68.6}\%bold_68.6 % 71.7%percent71.7\mathbf{71.7}\%bold_71.7 % 58.4%percent58.4\mathbf{58.4}\%bold_58.4 % 28.128.128.128.1
blue Pre-trained 3.653.653.653.65 5.185.185.185.18 0%percent00\%0 % 3.2%percent3.23.2\%3.2 % 7.2%percent7.27.2\%7.2 % -
Rewriting 5.005.00\mathbf{5.00}bold_5.00 5.455.45\mathbf{5.45}bold_5.45 61.8%percent61.861.8\%61.8 % 60.2%percent60.260.2\%60.2 % 17.7%percent17.717.7\%17.7 % 28.728.728.728.7
Ours (base) 3.853.853.853.85 5.215.215.215.21 81.3%percent81.3\mathbf{81.3}\%bold_81.3 % 89.7%percent89.7\mathbf{89.7}\%bold_89.7 % 66.2%percent66.2\mathbf{66.2}\%bold_66.2 % 34.734.734.734.7

Robustness to adversarial prompting. In order to understand whether adversarial prompts may cause the redacted model to generate content we would like to redact, we perform an adversarial prompting attack to redacted or rewritten models in this section. Specifically, we adopt the Square Attack (andriushchenko2020square, ; maus2023adversarial, ) directly to the discrete text space. For c𝒞Ω𝑐subscript𝒞Ωc\in\mathcal{C}_{\Omega}italic_c ∈ caligraphic_C start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT, the goal is to find an adversarial conditional cadvsubscript𝑐advc_{\mathrm{adv}}italic_c start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT such that corr(G(z|cadv),c)>corr(G(z|cadv),c^)corrsuperscript𝐺conditional𝑧subscript𝑐adv𝑐corrsuperscript𝐺conditional𝑧subscript𝑐adv^𝑐\mathrm{corr}(G^{\prime}(z|c_{\mathrm{adv}}),c)>\mathrm{corr}(G^{\prime}(z|c_{% \mathrm{adv}}),\hat{c})roman_corr ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z | italic_c start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT ) , italic_c ) > roman_corr ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z | italic_c start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT ) , over^ start_ARG italic_c end_ARG ). The algorithm is illustrated in Algorithm 1. See Appendix B-C for a few examples of successful attacks.

We measure the success rates of the proposed attack in Table V. The success rates for our redaction method is consistently lower than the Rewriting baseline (by 31%45%similar-topercent31percent4531\%\sim 45\%31 % ∼ 45 %), indicating our method is considerably more robust to adversarial prompting attacks than Rewriting.

Algorithm 1 Adversarial Prompting via Square Attack (andriushchenko2020square, ; maus2023adversarial, )
1:  Initialize cadv=csubscript𝑐adv𝑐c_{\mathrm{adv}}=citalic_c start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT = italic_c.
2:  for iteration = 1,,161161,\cdots,161 , ⋯ , 16 do
3:     Uniformly sample a position s𝑠sitalic_s of the caption cadvsubscript𝑐advc_{\mathrm{adv}}italic_c start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT to update.
4:     Uniformly sample 32 candidate words from the token dictionary. Construct 32 candidate adversarial captions by replacing the s𝑠sitalic_s-th token of cadvsubscript𝑐advc_{\mathrm{adv}}italic_c start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT with these words, respectively.
5:     Update the adversarial caption cadvsubscript𝑐advc_{\mathrm{adv}}italic_c start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT with the one with the largest sim(G(z|cadv),c)simsuperscript𝐺conditional𝑧subscript𝑐adv𝑐\mathrm{sim}(G^{\prime}(z|c_{\mathrm{adv}}),c)roman_sim ( italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z | italic_c start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT ) , italic_c ).
6:  end for
7:  return cadvsubscript𝑐advc_{\mathrm{adv}}italic_c start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT
TABLE V: Success rates of the adversarial prompting attack (Algorithm 1) to our redaction method and the Rewriting baseline. Our redaction method is more robust to such attacks than Rewriting.
Redaction prompts Method Attack Success Rate ()(\downarrow)( ↓ )
long beak,white belly Rewriting 92.8%percent92.892.8\%92.8 %
Ours (base) 50.3%percent50.3\mathbf{50.3}\%bold_50.3 %
blue / red wings Rewriting 97.4%percent97.497.4\%97.4 %
Ours (base) 65.7%percent65.7\mathbf{65.7}\%bold_65.7 %
blue Rewriting 81.1%percent81.181.1\%81.1 %
Ours (base) 35.5%percent35.5\mathbf{35.5}\%bold_35.5 %
yellow, red Rewriting 95.5%percent95.595.5\%95.5 %
Ours (base) 59.9%percent59.9\mathbf{59.9}\%bold_59.9 %

VI-C Redacting Diffusion-based Text-to-Speech Models

Setup. We use the pre-trained DiffWave model (kong2021diffwave, ) trained on the LJSpeech dataset (Ito2017ljspeech, ), which contains 13100 utterances from a female speaker reading books in home environment. The model is conditioned on Mel-spectrogram. We redact unseen voices from the disjoint LibriTTS dataset (zen2019libritts, ). We randomly choose five voices to redact: speakers 125, 1578, 1737, 1926 (female’s voice) and 1040 (men’s voice). The training set for each voice has total lengths between 4 and 6 minutes.

Table VI includes the specific train-test splits of the LibriTTS voices. Note that for speaker 1040 there is only one chapter id, so we split based on the segment id shown in columns.

TABLE VI: Specific train-test splits of the LibriTTS voices, and their total lengths measured in minutes.
Redaction voices training test
chapter id total length chapter id total length
speaker 125 121124 5.89 121342 2.30
speaker 1578 140045, 140049 4.81 6379 1.30
speaker 1737 142397, 148989, 142396 3.75 146161 2.51
speaker 1926 147979, 147987 5.44 143879 1.98
speaker 1040 133433 (0-98) 4.65 133433 (100-168) 2.35

CycleGAN-VC2, Whisper, and Tortoise-TTS details. We train CycleGAN-VC2 (kaneko2019cyclegan, ) with the following code 666https://github.com/jackaduma/CycleGAN-VC2. The training data for CycleGAN-VC2 is the training data of a LibriTTS voice and the first 100 samples of LJ003 from LJSpeech 777These equals 1%similar-toabsentpercent1\sim 1\%∼ 1 % of training utterances from LJSpeech (11similar-toabsent11\sim 11∼ 11 minutes).. We train CycleGAN-VC2 for 1000 iterations with a batch size of 8. We use the medium-sized English-only Whisper model 888https://github.com/openai/whisper and the Tortoise-TTS model 999https://github.com/neonbjb/tortoise-tts. To sample from Tortoise-TTS we use two 10-second utterances from LJSpeech as the reference voice.

Architecture and optimization. The architecture of the pre-trained model and other details are in Appendix C. The architecture of student conditioning networks with improved capacity is shown in Fig. 28. We use the Adam optimizer with a learning rate 0.0010.0010.0010.001 to optimize the 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss. The redaction algorithm terminates at 80000 iterations. We use a batch size of 32. Our distillation algorithm is trained with the spectrogram data only.

Configurations. We first use the uniform parallel distillation loss with λ=1.5𝜆1.5\lambda=1.5italic_λ = 1.5. We fix all up-sampling layers and denote it as the base configuration. We then use the spectrogram-rewriting module to improve capacity. Next, we improve voice cloning with Whisper and Tortoise-TTS when training CycleGAN-VC2. Finally, we investigate non-uniform distillation losses in Table I, where we set α=0.001𝛼0.001\alpha=0.001italic_α = 0.001 and β=0.01𝛽0.01\beta=0.01italic_β = 0.01 so that all wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s or λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s have the same order or magnitude.

Evaluation metrics. To evaluate generation quality on the training voice 𝒞𝒞Ω𝒞subscript𝒞Ω\mathcal{C}\setminus\mathcal{C}_{\Omega}caligraphic_C ∖ caligraphic_C start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT, we compute the following two speech quality metrics on the test set of LJSpeech: Perceptual Evaluation of Speech Quality (PESQ) (PESQ2001, ) and Short-Time Objective Intelligibility (STOI) (taal2011algorithm, ). To evaluate redaction quality, we train a speaker classifier between redacted and training voices in each experiment. We extract Mel-frequency cepstral coefficients (xu2005hmm, ), spectral contrast (jiang2002music, ), and chroma features (ellis2007chroma, ) as sample features and train a support vector classifier. We then compute the recall rate of redacted voices after we perform redaction. In contrast to the standard classification, a lower recall rate means a higher fraction of redacted voices are projected to the training voice by the edited model, which indicates better redaction quality. See Appendix C-B for details of these metrics.

TABLE VII: Results of generation and redaction quality for redacting the man speaker 1040 in LibriTTS. The λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT-order schedule in the non-uniform distillation losses leads to the best overall performance. The effects of each component within our method are displayed.
Method Speech quality (LJSpeech) Recall (𝒞Ωsubscript𝒞Ω\mathcal{C}_{\Omega}caligraphic_C start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT) (\downarrow)
PESQ (\uparrow) STOI (\uparrow)
Pre-trained 3.333.333.333.33 97.8%percent97.897.8\%97.8 % -
base 2.852.852.852.85 95.7%percent95.795.7\%95.7 % 52%percent5252\%52 %
  + improved capacity 3.033.033.033.03 96.6%percent96.696.6\%96.6 % 35%percent3535\%35 %
    + improved voice cloning 3.023.023.023.02 96.6%percent96.696.6\%96.6 % 35%percent3535\%35 %
\hdashline      + non-uniform λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT-order 3.233.23\mathbf{3.23}bold_3.23 97.4%percent97.4\mathbf{97.4}\%bold_97.4 % 40%percent4040\%40 %
λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT-dilation 3.213.213.213.21 97.4%percent97.4\mathbf{97.4}\%bold_97.4 % 50%percent5050\%50 %
wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT-order 3.023.023.023.02 96.6%percent96.696.6\%96.6 % 𝟐𝟗%percent29\mathbf{29}\%bold_29 %
wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT-dilation 3.023.023.023.02 96.6%percent96.696.6\%96.6 % 30%percent3030\%30 %
TABLE VIII: Results of generation and redaction quality for redacting several female speakers in LibriTTS. The improved capacity configuration leads to the best overall performance in most settings, with an exception for speaker 1926 where both configurations lead to similar performance.
Redaction Method Speech quality (LJSpeech) Recall (𝒞Ωsubscript𝒞Ω\mathcal{C}_{\Omega}caligraphic_C start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT) (\downarrow)
voices PESQ (\uparrow) STOI (\uparrow)
Pre-trained 3.333.333.333.33 97.8%percent97.897.8\%97.8 % -
speaker 125 base 3.143.143.143.14 97.0%percent97.097.0\%97.0 % 𝟎%percent0\mathbf{0}\%bold_0 %
  + improved capacity 3.273.27\mathbf{3.27}bold_3.27 97.4%percent97.4\mathbf{97.4}\%bold_97.4 % 3%percent33\%3 %
speaker 1578 base 2.142.142.142.14 94.4%percent94.494.4\%94.4 % 𝟏%percent1\mathbf{1}\%bold_1 %
  + improved capacity 3.243.24\mathbf{3.24}bold_3.24 97.4%percent97.4\mathbf{97.4}\%bold_97.4 % 3%percent33\%3 %
speaker 1737 base 2.492.492.492.49 94.9%percent94.994.9\%94.9 % 𝟒%percent4\mathbf{4}\%bold_4 %
  + improved capacity 3.243.24\mathbf{3.24}bold_3.24 97.2%percent97.2\mathbf{97.2}\%bold_97.2 % 9%percent99\%9 %
speaker 1926 base 3.063.06\mathbf{3.06}bold_3.06 96.3%percent96.396.3\%96.3 % 𝟏𝟔%percent16\mathbf{16}\%bold_16 %
  + improved capacity 3.043.043.043.04 96.6%percent96.6\mathbf{96.6}\%bold_96.6 % 𝟏𝟔%percent16\mathbf{16}\%bold_16 %

Results. The results for redacting speaker 1040 are shown in Table VII. With the base configuration we can redact a fraction of conditionals but the generation quality is much worse than the pre-trained model. By improving capacity both generation and redaction quality are improved. Improved voice cloning does not increase the quantitative metrics, but we find the generation quality is perceptually slightly better. The non-uniform distillation losses have a huge impact on the results. The λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT-order and λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT-dilation schedules can boost generation quality by a large gap without compensating redaction quality too much. The wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT-order and wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT-dilation schedules can improve redaction quality while kee** the generation quality. As high generation quality is very important for speech synthesis (on non-redacted voices), the λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT-order schedule leads to the best overall performance.

The results for redacting other speakers are shown in Table VIII. In most settings the improved capacity configuration leads to much better generation quality than the base configuration with very little compensation for redaction quality, except for speaker 1926 where results are similar.

Computation. On a single NVIDIA 3080 GPU, it takes less than 60 minutes to distill with the base configuration, and around 100 minutes with the other configurations. It takes around 2 hours to train the CycleGAN-VC2 model. As a comparison, DiffWave takes days to train on 8 GPUs.

Demo. We include audio samples in our demo website: https://dataredact2023.github.io/.

VII Conclusion and Discussion

In this paper, we introduce a formal statistical machine learning framework for redacting data from conditional generative models, and present a computationally efficient method that only involves the conditioning networks. We introduce explicit formula for simple models, and propose distillation-based methods for practical conditional models. Empirically, our method performs well for practical text-to-image/speech models. It is computationally efficient, and can effectively redact certain conditionals while retaining high generation quality. For redacting prompts in text-to-image models, our method redacts better and is considerably more robust than the baseline methods. For redacting voices in text-to-speech models, our method can redact both similar and different voices while retaining high speech quality and intelligibility.

In the following we include discussion on guaranteed safety, adversarial robustness, limitations of our method, and future work.

Guaranteed Safety and Fine-tuning

We first note that complete redaction to zero probability mass may be impossible for generative models with infinite support (as most deep generative models are). Take the unconditional normalizing flow as an example. We have the following proposition:

Proposition 1.

Let an invertible and smooth function F:ddnormal-:𝐹normal-→superscript𝑑superscript𝑑F:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}italic_F : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT be an unconditional normalizing flow on the data space dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT that converts a standard Gaussian 𝒩𝒩\mathcal{N}caligraphic_N to the output distribution F#𝒩subscript𝐹normal-#𝒩F_{\#}\mathcal{N}italic_F start_POSTSUBSCRIPT # end_POSTSUBSCRIPT caligraphic_N. For any set 𝒳𝒳\mathcal{X}caligraphic_X that has non-zero measure on dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, the probability mass on 𝒳𝒳\mathcal{X}caligraphic_X, (F#𝒩)(𝒳)subscript𝐹normal-#𝒩𝒳(F_{\#}\mathcal{N})(\mathcal{X})( italic_F start_POSTSUBSCRIPT # end_POSTSUBSCRIPT caligraphic_N ) ( caligraphic_X ), is positive.

Proof.
(F#𝒩)(𝒳)subscript𝐹#𝒩𝒳\displaystyle(F_{\#}\mathcal{N})(\mathcal{X})( italic_F start_POSTSUBSCRIPT # end_POSTSUBSCRIPT caligraphic_N ) ( caligraphic_X ) =x𝒳(F#𝒩)(x)𝑑xabsentsubscript𝑥𝒳subscript𝐹#𝒩𝑥differential-d𝑥\displaystyle=\int_{x\in\mathcal{X}}(F_{\#}\mathcal{N})(x)dx= ∫ start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT # end_POSTSUBSCRIPT caligraphic_N ) ( italic_x ) italic_d italic_x
=zF1(𝒳)𝒩(z)𝑑zabsentsubscript𝑧superscript𝐹1𝒳𝒩𝑧differential-d𝑧\displaystyle=\int_{z\in F^{-1}(\mathcal{X})}\mathcal{N}(z)dz= ∫ start_POSTSUBSCRIPT italic_z ∈ italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( caligraphic_X ) end_POSTSUBSCRIPT caligraphic_N ( italic_z ) italic_d italic_z
=𝒩(F1(𝒳)).absent𝒩superscript𝐹1𝒳\displaystyle=\mathcal{N}(F^{-1}(\mathcal{X})).= caligraphic_N ( italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( caligraphic_X ) ) .

Because 𝒳𝒳\mathcal{X}caligraphic_X has positive measure and F𝐹Fitalic_F is invertible and smooth, F1(𝒳)superscript𝐹1𝒳F^{-1}(\mathcal{X})italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( caligraphic_X ) has positive measure. Because 𝒩𝒩\mathcal{N}caligraphic_N is positive, 𝒩(F1(𝒳))>0𝒩superscript𝐹1𝒳0\mathcal{N}(F^{-1}(\mathcal{X}))>0caligraphic_N ( italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( caligraphic_X ) ) > 0, and therefore (F#𝒩)(𝒳)>0subscript𝐹#𝒩𝒳0(F_{\#}\mathcal{N})(\mathcal{X})>0( italic_F start_POSTSUBSCRIPT # end_POSTSUBSCRIPT caligraphic_N ) ( caligraphic_X ) > 0. ∎

Furthermore, (pham2023circumventing, ) discovered that fine-tuning on a set of in appropriate samples can break many mitigation methods for text-to-image models. We believe there is an impossibility result – if the adversary have access to a dataset of inappropriate samples and fine-tune on it, there is nothing a learner can do to prevent this. In the previous normalizing flow example, the adversary can optimize the following objective:

argmaxF𝔼x𝒳log[(F#𝒩)(x)]=argmaxF𝔼zF1(𝒳)log[𝒩(z)/|detzF(z)|]=argminF𝔼zF1(𝒳)(z22/2+log|detzF(z)|)subscript𝐹subscript𝔼similar-to𝑥𝒳subscript𝐹#𝒩𝑥absentsubscript𝐹subscript𝔼similar-to𝑧superscript𝐹1𝒳𝒩𝑧detsubscript𝑧𝐹𝑧absentsubscript𝐹subscript𝔼similar-to𝑧superscript𝐹1𝒳superscriptsubscriptnorm𝑧222detsubscript𝑧𝐹𝑧\begin{array}[]{l}\arg\max_{F}\mathbb{E}_{x\sim\mathcal{X}}\log[(F_{\#}% \mathcal{N})(x)]\\ =\arg\max_{F}\mathbb{E}_{z\sim F^{-1}(\mathcal{X})}\log[\mathcal{N}(z)/|% \mathrm{det}\nabla_{z}F(z)|]\\ =\arg\min_{F}\mathbb{E}_{z\sim F^{-1}(\mathcal{X})}(\|z\|_{2}^{2}/2+\log|% \mathrm{det}\nabla_{z}F(z)|)\end{array}start_ARRAY start_ROW start_CELL roman_arg roman_max start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_X end_POSTSUBSCRIPT roman_log [ ( italic_F start_POSTSUBSCRIPT # end_POSTSUBSCRIPT caligraphic_N ) ( italic_x ) ] end_CELL end_ROW start_ROW start_CELL = roman_arg roman_max start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_z ∼ italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( caligraphic_X ) end_POSTSUBSCRIPT roman_log [ caligraphic_N ( italic_z ) / | roman_det ∇ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_F ( italic_z ) | ] end_CELL end_ROW start_ROW start_CELL = roman_arg roman_min start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_z ∼ italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( caligraphic_X ) end_POSTSUBSCRIPT ( ∥ italic_z ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 + roman_log | roman_det ∇ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_F ( italic_z ) | ) end_CELL end_ROW end_ARRAY

and as a result having more probability mass on 𝒳𝒳\mathcal{X}caligraphic_X.

Despite this impossibility result, our method has largely increased the barrier for users to generate undesirable contents. For example, if the adversary do not have enough data of a certain celebrity’s voice they are then not able to reverse engineering the network by fine-tuning on those data. In practice, it is usually necessary to combine different security mechanisms to ensure safety of generation.

Adversarial Robustness

We have shown our method is less susceptible to be attacked by an existing adversarial prompting method than the baseline method in the text-to-image experiments. However, we would like to note that a formal definition of adversarial robustness in conditional generative models (e.g. text-to-X) is a largely open problem. We think many different threat models can be defined depending on the setting and adversary’s goal, capabilities, and knowledge of the model, which is outside the scope of this paper.

Limitations

There are certain types of neural networks that our method cannot be easily adapted to, especially when the conditional network is not completely independent from the main generative network. Examples include StyleGAN (karras2020analyzing, ), Transformer-based architectures with complex cross-attention layers, or multi-modal networks that mix input tokens from different modalities at the beginning.

Future work

One important future direction is to further improve robustness against adversarial attacks. Another line of future work is to apply the proposed method to Transformer-based architectures, where the conditioning networks are based on cross-attention blocks. A third direction is to extend our method to the online setting where redacted samples come in a stream. To achieve this, we need to modify the loss in (4) by using a weighted sampling strategy that assigns higher probability to newly seen samples.

Acknowledgements

This work was supported by NSF under CNS 1804829 and ARO MURI W911NF2110317.

References

  • (1) R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” 2021.
  • (2) A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in International Conference on Machine Learning.   PMLR, 2021, pp. 8821–8831.
  • (3) A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, 2022.
  • (4) A. Sauer, T. Karras, S. Laine, A. Geiger, and T. Aila, “Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis,” arXiv preprint arXiv:2301.09515, 2023.
  • (5) Z. Kong, W. **, J. Huang, K. Zhao, and B. Catanzaro, “Diffwave: A versatile diffusion model for audio synthesis,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=a-xFK8Ymz5J
  • (6) S.-g. Lee, W. **, B. Ginsburg, B. Catanzaro, and S. Yoon, “Bigvgan: A universal neural vocoder with large-scale training,” in International Conference on Learning Representations, 2023.
  • (7) OpenAI, “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
  • (8) H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
  • (9) A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” arXiv preprint arXiv:2112.10741, 2021.
  • (10) A. Birhane, V. U. Prabhu, and E. Kahembwe, “Multimodal datasets: misogyny, pornography, and malignant stereotypes,” arXiv preprint arXiv:2110.01963, 2021.
  • (11) C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman et al., “Laion-5b: An open large-scale dataset for training next generation image-text models,” arXiv preprint arXiv:2210.08402, 2022.
  • (12) J. Rando, D. Paleka, D. Lindner, L. Heim, and F. Tramèr, “Red-teaming the stable diffusion safety filter,” arXiv preprint arXiv:2210.04610, 2022.
  • (13) P. Bedapudi. (2022) Nudenet: Neural nets for nudity detection and censoring. [Online]. Available: https://github.com/notAI-tech/NudeNet
  • (14) G. Laborde, “Deep nn for nsfw detection,” 2022. [Online]. Available: https://github.com/GantMan/nsfw_model
  • (15) J. Betker. (2022, 4) TorToiSe text-to-speech. [Online]. Available: https://github.com/neonbjb/tortoise-tts
  • (16) C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li et al., “Neural codec language models are zero-shot text to speech synthesizers,” arXiv preprint arXiv:2301.02111, 2023.
  • (17) Z. Zhang, L. Zhou, C. Wang, S. Chen, Y. Wu, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li et al., “Speak foreign languages with your own voice: Cross-lingual neural codec language modeling,” arXiv preprint arXiv:2303.03926, 2023.
  • (18) G. K. Pitsilis, H. Ramampiaro, and H. Langseth, “Detecting offensive language in tweets using deep learning,” arXiv preprint arXiv:1801.04433, 2018.
  • (19) E. Wallace, S. Feng, N. Kandpal, M. Gardner, and S. Singh, “Universal adversarial triggers for attacking and analyzing nlp,” arXiv preprint arXiv:1908.07125, 2019.
  • (20) K. McGuffie and A. Newhouse, “The radicalization risks of gpt-3 and advanced neural language models,” arXiv preprint arXiv:2009.06807, 2020.
  • (21) S. Gehman, S. Gururangan, M. Sap, Y. Choi, and N. A. Smith, “Realtoxicityprompts: Evaluating neural toxic degeneration in language models,” arXiv preprint arXiv:2009.11462, 2020.
  • (22) A. Abid, M. Farooqi, and J. Zou, “Persistent anti-muslim bias in large language models,” in Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, 2021, pp. 298–306.
  • (23) E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving, “Red teaming language models with language models,” arXiv preprint arXiv:2202.03286, 2022.
  • (24) P. Schramowski, C. Turan, N. Andersen, C. A. Rothkopf, and K. Kersting, “Large pre-trained language models contain human-like biases of what is right and wrong to do,” Nature Machine Intelligence, vol. 4, no. 3, pp. 258–268, 2022.
  • (25) P. Schramowski, M. Brack, B. Deiseroth, and K. Kersting, “Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models,” arXiv preprint arXiv:2211.05105, 2022.
  • (26) Z. Kong and K. Chaudhuri, “Data redaction from pre-trained gans,” in First IEEE Conference on Secure and Trustworthy Machine Learning, 2023.
  • (27) M. Zhu, P. Pan, W. Chen, and Y. Yang, “Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5802–5810.
  • (28) D. Bau, S. Liu, T. Wang, J.-Y. Zhu, and A. Torralba, “Rewriting a deep generative model,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16.   Springer, 2020, pp. 351–369.
  • (29) Y. Cao and J. Yang, “Towards making systems forget with machine unlearning,” in 2015 IEEE Symposium on Security and Privacy.   IEEE, 2015, pp. 463–480.
  • (30) C. Guo, T. Goldstein, A. Hannun, and L. Van Der Maaten, “Certified data removal from machine learning models,” arXiv preprint arXiv:1911.03030, 2019.
  • (31) S. Schelter, “Amnesia - machine learning models that can forget user data very fast.” in CIDR, 2020.
  • (32) S. Neel, A. Roth, and S. Sharifi-Malvajerdi, “Descent-to-delete: Gradient-based methods for machine unlearning,” in Algorithmic Learning Theory.   PMLR, 2021, pp. 931–962.
  • (33) A. Sekhari, J. Acharya, G. Kamath, and A. T. Suresh, “Remember what you want to forget: Algorithms for machine unlearning,” Advances in Neural Information Processing Systems, vol. 34, 2021.
  • (34) Z. Izzo, M. A. Smart, K. Chaudhuri, and J. Zou, “Approximate data deletion from machine learning models,” in International Conference on Artificial Intelligence and Statistics.   PMLR, 2021, pp. 2008–2016.
  • (35) E. Ullah, T. Mai, A. Rao, R. A. Rossi, and R. Arora, “Machine unlearning via algorithmic stability,” in Conference on Learning Theory.   PMLR, 2021, pp. 4126–4142.
  • (36) L. Bourtoule, V. Chandrasekaran, C. A. Choquette-Choo, H. Jia, A. Travers, B. Zhang, D. Lie, and N. Papernot, “Machine unlearning,” in 2021 IEEE Symposium on Security and Privacy (SP).   IEEE, 2021, pp. 141–159.
  • (37) A. Warnecke, L. Pirch, C. Wressnegger, and K. Rieck, “Machine unlearning of features and labels,” arXiv preprint arXiv:2108.11577, 2021.
  • (38) Z. Kong and S. Alfeld, “Approximate data deletion in generative models,” arXiv preprint arXiv:2206.14439, 2022.
  • (39) I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in neural information processing systems, vol. 27, 2014.
  • (40) D. Bau, H. Strobelt, W. Peebles, J. Wulff, B. Zhou, J.-Y. Zhu, and A. Torralba, “Semantic photo manipulation with a generative image prior,” arXiv preprint arXiv:2005.07727, 2020.
  • (41) J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851, 2020.
  • (42) O. Bar-Tal, D. Ofri-Amar, R. Fridman, Y. Kasten, and T. Dekel, “Text2live: Text-driven layered image and video editing,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XV.   Springer, 2022, pp. 707–723.
  • (43) A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, and D. Cohen-Or, “Prompt-to-prompt image editing with cross attention control,” arXiv preprint arXiv:2208.01626, 2022.
  • (44) B. Kawar, S. Zada, O. Lang, O. Tov, H. Chang, T. Dekel, I. Mosseri, and M. Irani, “Imagic: Text-based real image editing with diffusion models,” arXiv preprint arXiv:2210.09276, 2022.
  • (45) D. Valevski, M. Kalman, Y. Matias, and Y. Leviathan, “Unitune: Text-driven image editing by fine tuning an image generation model on a single image,” arXiv preprint arXiv:2210.09477, 2022.
  • (46) M. Brack, P. Schramowski, F. Friedrich, D. Hintersdorf, and K. Kersting, “The stable artist: Steering semantics in diffusion latent space,” arXiv preprint arXiv:2212.06013, 2022.
  • (47) S. Asokan and C. Seelamantula, “Teaching a gan what not to learn,” Advances in Neural Information Processing Systems, vol. 33, pp. 3964–3975, 2020.
  • (48) A. Sinha, K. Ayush, J. Song, B. Uzkent, H. **, and S. Ermon, “Negative data augmentation,” in International Conference on Learning Representations, 2021.
  • (49) A. Cherepkov, A. Voynov, and A. Babenko, “Navigating the gan parameter space for semantic image editing,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 3671–3680.
  • (50) S. Malnick, S. Avidan, and O. Fried, “Taming a generative model,” arXiv preprint arXiv:2211.16488, 2022.
  • (51) S. Moon, S. Cho, and D. Kim, “Feature unlearning for generative models via implicit feedback,” arXiv preprint arXiv:2303.05699, 2023.
  • (52) R. Gandikota, J. Materzynska, J. Fiotto-Kaufman, and D. Bau, “Erasing concepts from diffusion models,” arXiv preprint arXiv:2303.07345, 2023.
  • (53) R. Gandikota, H. Orgad, Y. Belinkov, J. Materzyńska, and D. Bau, “Unified concept editing in diffusion models,” arXiv preprint arXiv:2308.14761, 2023.
  • (54) E. Zhang, K. Wang, X. Xu, Z. Wang, and H. Shi, “Forget-me-not: Learning to forget in text-to-image diffusion models,” arXiv preprint arXiv:2303.17591, 2023.
  • (55) A. Heng and H. Soh, “Selective amnesia: A continual learning approach to forgetting in deep generative models,” arXiv preprint arXiv:2305.10120, 2023.
  • (56) N. Kumari, B. Zhang, S.-Y. Wang, E. Shechtman, R. Zhang, and J.-Y. Zhu, “Ablating concepts in text-to-image diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 691–22 702.
  • (57) J. Ho and T. Salimans, “Classifier-free diffusion guidance,” arXiv preprint arXiv:2207.12598, 2022.
  • (58) M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014.
  • (59) P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona, “Caltech-ucsd birds 200,” Caltech, Tech. Rep. CNS-TR-201, 2010. [Online]. Available: http://www.vision.caltech.edu/visipedia/CUB-200.html
  • (60) S. Reed, Z. Akata, H. Lee, and B. Schiele, “Learning deep representations of fine-grained visual descriptions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 49–58.
  • (61) T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, “Attngan: Fine-grained text to image generation with attentional generative adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1316–1324.
  • (62) T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, “Cyclegan-vc2: Improved cyclegan-based non-parallel voice conversion,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 6820–6824.
  • (63) A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” arXiv preprint arXiv:2212.04356, 2022.
  • (64) Y. LeCun, C. Cortes, and C. Burges, “Mnist handwritten digit database,” ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, vol. 2, 2010.
  • (65) D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • (66) T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” Advances in neural information processing systems, vol. 29, 2016.
  • (67) C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826.
  • (68) M. Andriushchenko, F. Croce, N. Flammarion, and M. Hein, “Square attack: a query-efficient black-box adversarial attack via random search,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIII.   Springer, 2020, pp. 484–501.
  • (69) N. Maus, P. Chao, E. Wong, and J. Gardner, “Adversarial prompting for black box foundation models,” arXiv preprint arXiv:2302.04237, 2023.
  • (70) K. Ito, “The LJ speech dataset,” 2017.
  • (71) H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, “Libritts: A corpus derived from librispeech for text-to-speech,” arXiv preprint arXiv:1904.02882, 2019.
  • (72) I.-T. Recommendation, “Perceptual evaluation of speech quality (pesq): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs,” Rec. ITU-T P. 862, 2001.
  • (73) C. H. Taal et al., “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, 2011.
  • (74) M. Xu, L.-Y. Duan, J. Cai, L.-T. Chia, C. Xu, and Q. Tian, “Hmm-based audio keyword generation,” in Advances in Multimedia Information Processing-PCM 2004: 5th Pacific Rim Conference on Multimedia, Tokyo, Japan, November 30-December 3, 2004. Proceedings, Part III 5.   Springer, 2005, pp. 566–574.
  • (75) D.-N. Jiang, L. Lu, H.-J. Zhang, J.-H. Tao, and L.-H. Cai, “Music type classification by spectral contrast feature,” in Proceedings. IEEE International Conference on Multimedia and Expo, vol. 1.   IEEE, 2002, pp. 113–116.
  • (76) D. Ellis, “Chroma feature analysis and synthesis,” Resources of laboratory for the recognition and organization of speech and audio-LabROSA, vol. 5, 2007.
  • (77) M. Pham, K. O. Marshall, and C. Hegde, “Circumventing concept erasure methods for text-to-image generative models,” arXiv preprint arXiv:2308.01508, 2023.
  • (78) T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of stylegan,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 8110–8119.

Appendix A Recovering Original Score in (gandikota2023erasing, )

Let ϵθ*(xt,c,t)subscriptitalic-ϵsuperscript𝜃subscript𝑥𝑡𝑐𝑡\epsilon_{\theta^{*}}(x_{t},c,t)italic_ϵ start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c , italic_t ) be the original score and

ϵθ(xt,c,t)=ϵθ*(xt,t)η(ϵθ*(xt,c,t)ϵθ*(xt,t))subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑐𝑡subscriptitalic-ϵsuperscript𝜃subscript𝑥𝑡𝑡𝜂subscriptitalic-ϵsuperscript𝜃subscript𝑥𝑡𝑐𝑡subscriptitalic-ϵsuperscript𝜃subscript𝑥𝑡𝑡\epsilon_{\theta}(x_{t},c,t)=\epsilon_{\theta^{*}}(x_{t},t)-\eta(\epsilon_{% \theta^{*}}(x_{t},c,t)-\epsilon_{\theta^{*}}(x_{t},t))italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c , italic_t ) = italic_ϵ start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) - italic_η ( italic_ϵ start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c , italic_t ) - italic_ϵ start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) )

be the distilled score. We could recover the original score from the distilled score in the following way. First, by letting c=𝑐c=\emptysetitalic_c = ∅, we have

ϵθ(xt,t)=ϵθ*(xt,t).subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡subscriptitalic-ϵsuperscript𝜃subscript𝑥𝑡𝑡\epsilon_{\theta}(x_{t},t)=\epsilon_{\theta^{*}}(x_{t},t).italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) = italic_ϵ start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) .

Inserting this to the right-hand-side of the definition of distilled score, one could get

ϵθ(xt,c,t)=ϵθ(xt,t)η(ϵθ*(xt,c,t)ϵθ(xt,t)).subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑐𝑡subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡𝜂subscriptitalic-ϵsuperscript𝜃subscript𝑥𝑡𝑐𝑡subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡\epsilon_{\theta}(x_{t},c,t)=\epsilon_{\theta}(x_{t},t)-\eta(\epsilon_{\theta^% {*}}(x_{t},c,t)-\epsilon_{\theta}(x_{t},t)).italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c , italic_t ) = italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) - italic_η ( italic_ϵ start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c , italic_t ) - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ) .

As a result, one can recover the original score as

ϵθ*(xt,c,t)=1η((1+η)ϵθ(xt,t)ϵθ(xt,c,t)).subscriptitalic-ϵsuperscript𝜃subscript𝑥𝑡𝑐𝑡1𝜂1𝜂subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑐𝑡\epsilon_{\theta^{*}}(x_{t},c,t)=\frac{1}{\eta}((1+\eta)\epsilon_{\theta}(x_{t% },t)-\epsilon_{\theta}(x_{t},c,t)).italic_ϵ start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c , italic_t ) = divide start_ARG 1 end_ARG start_ARG italic_η end_ARG ( ( 1 + italic_η ) italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c , italic_t ) ) .

By using the original score for sampling one may be able to generate concepts that have been erased.

Appendix B Additional Details and Experiments for Redaction from DM-GAN

B-A Details of the Pre-trained Model and the Proposed Student Networks

The high-level architecture of DM-GAN is shown in Fig. 3 and 4. The first conditioning network H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT takes the sentence embedding vs(c)subscript𝑣𝑠𝑐v_{s}(c)italic_v start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_c ) as input and outputs two vectors: a mean vector, and the square root of the variance vector. A re-parameterization similar to variational auto-encoders is applied to these two vectors, and the output is concatenated to the latent code. The other two conditioning networks H2subscript𝐻2H_{2}italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and H3subscript𝐻3H_{3}italic_H start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, called the memory writing module, take two inputs: the word embeddings vw(c)subscript𝑣𝑤𝑐v_{w}(c)italic_v start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_c ), and the image features of the previously generated low resolution images. The output of H2subscript𝐻2H_{2}italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT or H3subscript𝐻3H_{3}italic_H start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT then goes through the rest of the modules in the main generative network. We use the pre-trained model and code from https://github.com/MinfengZhu/DM-GAN under MIT license. The pre-trained model takes days to train on 1 or more GPUs.

Refer to caption
Figure 3: High-level architecture of DM-GAN.
Refer to caption
Figure 4: High-level architecture of original and higher-capacity conditioning networks of DM-GAN.

B-B Visualization

In Fig. 5 - Fig. 8 , we visualize examples where we redact prompts that contain long beak or white belly. In Fig. 9 - Fig. 12 , we visualize examples where we redact prompts that contain blue wings or red wings. In Fig. 13 - Fig. 16 , we visualize examples where we redact prompts that contain blue. In Fig. 17 - Fig. 20 , we visualize examples where we redact prompts that contain yellow or red.

Refer to caption
(a) Pre-trained G(|c)G(\cdot|c)italic_G ( ⋅ | italic_c )
Refer to caption
(b) Reference G(|c^)G(\cdot|\hat{c})italic_G ( ⋅ | over^ start_ARG italic_c end_ARG )
Refer to caption
(c) Our Redaction G(|c)G^{\prime}(\cdot|c)italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_c )
Refer to caption
(d) Rewriting Baseline
Figure 5: Redacted prompt: ‘‘this particular bird has a white belly and breasts and black head and back’’. Reference prompt: ‘‘this particular bird has a black belly and breasts and black head and back’’.
Refer to caption
(a) Pre-trained G(|c)G(\cdot|c)italic_G ( ⋅ | italic_c )
Refer to caption
(b) Reference G(|c^)G(\cdot|\hat{c})italic_G ( ⋅ | over^ start_ARG italic_c end_ARG )
Refer to caption
(c) Our Redaction G(|c)G^{\prime}(\cdot|c)italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_c )
Refer to caption
(d) Rewriting Baseline
Figure 6: Redacted prompt: ‘‘this bird has feathers that are black and has a white belly’’. Reference prompt: ‘‘this bird has feathers that are black and has a black belly’’.
Refer to caption
(a) Pre-trained G(|c)G(\cdot|c)italic_G ( ⋅ | italic_c )
Refer to caption
(b) Reference G(|c^)G(\cdot|\hat{c})italic_G ( ⋅ | over^ start_ARG italic_c end_ARG )
Refer to caption
(c) Our Redaction G(|c)G^{\prime}(\cdot|c)italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_c )
Refer to caption
(d) Rewriting Baseline
Figure 7: Redacted prompt: ‘‘a small bird with an orange throat and long beak’’. Reference prompt: ‘‘a small bird with an orange throat and short beak’’.
Refer to caption
(a) Pre-trained G(|c)G(\cdot|c)italic_G ( ⋅ | italic_c )
Refer to caption
(b) Reference G(|c^)G(\cdot|\hat{c})italic_G ( ⋅ | over^ start_ARG italic_c end_ARG )
Refer to caption
(c) Our Redaction G(|c)G^{\prime}(\cdot|c)italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_c )
Refer to caption
(d) Rewriting Baseline
Figure 8: Redacted prompt: ‘‘the black and white bird has a sharp long beak’’. Reference prompt: ‘‘the black and white bird has a sharp short beak’’.
Refer to caption
(a) Pre-trained G(|c)G(\cdot|c)italic_G ( ⋅ | italic_c )
Refer to caption
(b) Reference G(|c^)G(\cdot|\hat{c})italic_G ( ⋅ | over^ start_ARG italic_c end_ARG )
Refer to caption
(c) Our Redaction G(|c)G^{\prime}(\cdot|c)italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_c )
Refer to caption
(d) Rewriting Baseline
Figure 9: Redacted prompt: ‘‘this bird has wings that are blue and has black feet’’. Reference prompt: ‘‘this bird has wings that are white and has black feet’’.
Refer to caption
(a) Pre-trained G(|c)G(\cdot|c)italic_G ( ⋅ | italic_c )
Refer to caption
(b) Reference G(|c^)G(\cdot|\hat{c})italic_G ( ⋅ | over^ start_ARG italic_c end_ARG )
Refer to caption
(c) Our Redaction G(|c)G^{\prime}(\cdot|c)italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_c )
Refer to caption
(d) Rewriting Baseline
Figure 10: Redacted prompt: ‘‘this is a grey bird with blue wings and a pointy beak’’. Reference prompt: ‘‘this is a grey bird with white wings and a pointy beak’’.
Refer to caption
(a) Pre-trained G(|c)G(\cdot|c)italic_G ( ⋅ | italic_c )
Refer to caption
(b) Reference G(|c^)G(\cdot|\hat{c})italic_G ( ⋅ | over^ start_ARG italic_c end_ARG )
Refer to caption
(c) Our Redaction G(|c)G^{\prime}(\cdot|c)italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_c )
Refer to caption
(d) Rewriting Baseline
Figure 11: Redacted prompt: ‘‘this bird has wings that are red and has a white belly’’. Reference prompt: ‘‘this bird has wings that are white and has a white belly’’.
Refer to caption
(a) Pre-trained G(|c)G(\cdot|c)italic_G ( ⋅ | italic_c )
Refer to caption
(b) Reference G(|c^)G(\cdot|\hat{c})italic_G ( ⋅ | over^ start_ARG italic_c end_ARG )
Refer to caption
(c) Our Redaction G(|c)G^{\prime}(\cdot|c)italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_c )
Refer to caption
(d) Rewriting Baseline
Figure 12: Redacted prompt: ‘‘this bird has wings that are red and has a yellow belly’’. Reference prompt: ‘‘this bird has wings that are white and has a yellow belly’’.
Refer to caption
(a) Pre-trained G(|c)G(\cdot|c)italic_G ( ⋅ | italic_c )
Refer to caption
(b) Reference G(|c^)G(\cdot|\hat{c})italic_G ( ⋅ | over^ start_ARG italic_c end_ARG )
Refer to caption
(c) Our Redaction G(|c)G^{\prime}(\cdot|c)italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_c )
Refer to caption
(d) Rewriting Baseline
Figure 13: Redacted prompt: ‘‘this bird has wings that are blue and has black feet’’. Reference prompt: ‘‘this bird has wings that are red and has black feet’’.
Refer to caption
(a) Pre-trained G(|c)G(\cdot|c)italic_G ( ⋅ | italic_c )
Refer to caption
(b) Reference G(|c^)G(\cdot|\hat{c})italic_G ( ⋅ | over^ start_ARG italic_c end_ARG )
Refer to caption
(c) Our Redaction G(|c)G^{\prime}(\cdot|c)italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_c )
Refer to caption
(d) Rewriting Baseline
Figure 14: Redacted prompt: ‘‘this bird has small wings and blue grey nape’’. Reference prompt: ‘‘this bird has small wings and red grey nape’’.
Refer to caption
(a) Pre-trained G(|c)G(\cdot|c)italic_G ( ⋅ | italic_c )
Refer to caption
(b) Reference G(|c^)G(\cdot|\hat{c})italic_G ( ⋅ | over^ start_ARG italic_c end_ARG )
Refer to caption
(c) Our Redaction G(|c)G^{\prime}(\cdot|c)italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_c )
Refer to caption
(d) Rewriting Baseline
Figure 15: Redacted prompt: ‘‘the bird is blue with gray wins and tail’’. Reference prompt: ‘‘the bird is red with gray wins and tail’’.
Refer to caption
(a) Pre-trained G(|c)G(\cdot|c)italic_G ( ⋅ | italic_c )
Refer to caption
(b) Reference G(|c^)G(\cdot|\hat{c})italic_G ( ⋅ | over^ start_ARG italic_c end_ARG )
Refer to caption
(c) Our Redaction G(|c)G^{\prime}(\cdot|c)italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_c )
Refer to caption
(d) Rewriting Baseline
Figure 16: Redacted prompt: ‘‘this bird has wings that are blue and has a white belly’’. Reference prompt: ‘‘this bird has wings that are red and has a white belly’’.
Refer to caption
(a) Pre-trained G(|c)G(\cdot|c)italic_G ( ⋅ | italic_c )
Refer to caption
(b) Reference G(|c^)G(\cdot|\hat{c})italic_G ( ⋅ | over^ start_ARG italic_c end_ARG )
Refer to caption
(c) Our Redaction G(|c)G^{\prime}(\cdot|c)italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_c )
Refer to caption
(d) Rewriting Baseline
Figure 17: Redacted prompt: ‘‘this is a red bird with a white belly and a large beak’’. Reference prompt: ‘‘this is a black bird with a white belly and a large beak’’.
Refer to caption
(a) Pre-trained G(|c)G(\cdot|c)italic_G ( ⋅ | italic_c )
Refer to caption
(b) Reference G(|c^)G(\cdot|\hat{c})italic_G ( ⋅ | over^ start_ARG italic_c end_ARG )
Refer to caption
(c) Our Redaction G(|c)G^{\prime}(\cdot|c)italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_c )
Refer to caption
(d) Rewriting Baseline
Figure 18: Redacted prompt: ‘‘a bird with thick short beak red crown red breast that fades into a pink and white belly and red coverts’’. Reference prompt: ‘‘a bird with thick short beak black crown black breast that fades into a pink and white belly and black coverts’’.
Refer to caption
(a) Pre-trained G(|c)G(\cdot|c)italic_G ( ⋅ | italic_c )
Refer to caption
(b) Reference G(|c^)G(\cdot|\hat{c})italic_G ( ⋅ | over^ start_ARG italic_c end_ARG )
Refer to caption
(c) Our Redaction G(|c)G^{\prime}(\cdot|c)italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_c )
Refer to caption
(d) Rewriting Baseline
Figure 19: Redacted prompt: ‘‘this yellow breasted bird has a dark gray head and chest a thin beak and a long tail’’. Reference prompt: ‘‘this black breasted bird has a dark gray head and chest a thin beak and a long tail’’.
Refer to caption
(a) Pre-trained G(|c)G(\cdot|c)italic_G ( ⋅ | italic_c )
Refer to caption
(b) Reference G(|c^)G(\cdot|\hat{c})italic_G ( ⋅ | over^ start_ARG italic_c end_ARG )
Refer to caption
(c) Our Redaction G(|c)G^{\prime}(\cdot|c)italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_c )
Refer to caption
(d) Rewriting Baseline
Figure 20: Redacted prompt: ‘‘the yellow and black bird has a white belly’’. Reference prompt: ‘‘the black and black bird has a white belly’’.

B-C Adversarial Prompting Attack

Refer to caption
(a) Pre-trained G(|c)G(\cdot|c)italic_G ( ⋅ | italic_c )
Refer to caption
(b) Our Redaction G(|c)G^{\prime}(\cdot|c)italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_c )
Refer to caption
(c) Adversarial G(|cadv)G^{\prime}(\cdot|c_{\mathrm{adv}})italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_c start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT )
Figure 21: Redacted prompt: ‘‘a short yellow beak on an otherwise black bird bright irises around black pupils’’. Adversarial prompt: ‘‘a short yellow wingtip on yhellow otherwise wa bird bright irises around dramatic pupils’’.
Refer to caption
(a) Pre-trained G(|c)G(\cdot|c)italic_G ( ⋅ | italic_c )
Refer to caption
(b) Our Redaction G(|c)G^{\prime}(\cdot|c)italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_c )
Refer to caption
(c) Adversarial G(|cadv)G^{\prime}(\cdot|c_{\mathrm{adv}})italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_c start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT )
Figure 22: Redacted prompt: ‘‘this is a colorful bird boasting a coral red color strongest at head and fading towards tail’’. Adversarial prompt: ‘‘wingtip is a colorful bird boasting the coral red color strongest at primary and fading towards tail’’.
Refer to caption
(a) Pre-trained G(|c)G(\cdot|c)italic_G ( ⋅ | italic_c )
Refer to caption
(b) Our Redaction G(|c)G^{\prime}(\cdot|c)italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_c )
Refer to caption
(c) Adversarial G(|cadv)G^{\prime}(\cdot|c_{\mathrm{adv}})italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_c start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT )
Figure 23: Redacted prompt: ‘‘this small bird has a light yellow belly and a dark yellow head with black wings and tail’’. Adversarial prompt: ‘‘this small bird the the light yellow belly and a dark yellow head with black remaining and mahogany’’.
Refer to caption
(a) Pre-trained G(|c)G(\cdot|c)italic_G ( ⋅ | italic_c )
Refer to caption
(b) Our Redaction G(|c)G^{\prime}(\cdot|c)italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_c )
Refer to caption
(c) Adversarial G(|cadv)G^{\prime}(\cdot|c_{\mathrm{adv}})italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_c start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT )
Figure 24: Redacted prompt: ‘‘this bird is golden yellow with a black head’’. Adversarial prompt: ‘‘through mahogany wingtip golden yellow bed a black head’’.
Refer to caption
(a) Pre-trained G(|c)G(\cdot|c)italic_G ( ⋅ | italic_c )
Refer to caption
(b) Our Redaction G(|c)G^{\prime}(\cdot|c)italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_c )
Refer to caption
(c) Adversarial G(|cadv)G^{\prime}(\cdot|c_{\mathrm{adv}})italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_c start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT )
Figure 25: Redacted prompt: ‘‘this bird has a golden yellow body a black head a pink bill that curves downward’’. Adversarial prompt: ‘‘mahogany the has blacker golden yellow basically a black head a primary bill that through hughed’’.
Refer to caption
(a) Pre-trained G(|c)G(\cdot|c)italic_G ( ⋅ | italic_c )
Refer to caption
(b) Our Redaction G(|c)G^{\prime}(\cdot|c)italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_c )
Refer to caption
(c) Adversarial G(|cadv)G^{\prime}(\cdot|c_{\mathrm{adv}})italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⋅ | italic_c start_POSTSUBSCRIPT roman_adv end_POSTSUBSCRIPT )
Figure 26: Redacted prompt: ‘‘this bird has a black beak a black crown and a belly that is golden yellow’’. Adversarial prompt: ‘‘mahogany love has a riangular love bed black crown and a dramatic hjas breadt golden yellow’’.

Appendix C Additional Details and Experiments for Redaction from DiffWave

C-A Details of the Pre-trained Model and the Proposed Student Networks

The high-level architecture of DiffWave is shown in Fig. 27. We select the base (64 channels) version of the model. The model is conditioned on 80-band Mel-spectrogram with FFT size=1024absent1024=1024= 1024, hop size=256absent256=256= 256, and window size=1024absent1024=1024= 1024. Each conditioning network has two up-sampling layers that up-sample the spectrogram, and a one-dimensional convolution layer that maps the number of channels to 128. We use the pre-trained model and code from https://github.com/philsyn/DiffWave-Vocoder under MIT license, which is trained on all LJSpeech samples except for LJ001 and LJ002, which is used as the test set. The pre-trained model takes days to train on 8 GPUs.

Refer to caption
Figure 27: High-level architecture of DiffWave.

For the additional layers in the improved capacity configuration, all convolutions are one-dimensional with kernel size =1absent1=1= 1. htranssubscripttransh_{\mathrm{trans}}italic_h start_POSTSUBSCRIPT roman_trans end_POSTSUBSCRIPT includes two convolutions that keep the channels (=80absent80=80= 80) and a leaky ReLU activation with negative slope =0.4absent0.4=0.4= 0.4 between. hgatesubscriptgateh_{\mathrm{gate}}italic_h start_POSTSUBSCRIPT roman_gate end_POSTSUBSCRIPT includes one zero-initialized convolution that changes channels from 80 to 128 followed by a sigmoid activation. The architecture of student conditioning networks with improved capacity is shown in Fig. 28.

Refer to caption
Figure 28: High-level architecture of original and higher-capacity conditioning networks of DiffWave.

C-B Evaluation Metrics

The metrics for speech quality are as follows.

  1. 1.

    Perceptual Evaluation of Speech Quality (PESQ2001, ), or PESQ, measures the quality of generated speech. It ranges between -0.5 and 4.5 and is higher for better quality.

  2. 2.

    Short-Time Objective Intelligibility (taal2011algorithm, ), or STOI, measures the intelligibility of generated speech. It ranges between 0% and 100% and is higher for better intelligibility.

The voice classifier is trained and tested on audio clips with 0.7256 second. For each audio clip, we extract 20-dimensional Mel-frequency cepstral coefficients (xu2005hmm, ), 7-dimensional spectral contrast (jiang2002music, ), and 12-dimensional chroma features (ellis2007chroma, ). The classifier is a support vector classifier with the radial basis function kernel with regularization coefficient =1absent1=1= 1.