Robust Concept Erasure Using Task Vectors

Minh Pham, Kelly O. Marshall, Chinmay Hegde, and Niv Cohen
New York Univeristy
{mp5847,km3888,chinmay.h,nc3468}@nyu.edu

Abstract

With the rapid growth of text-to-image models, a variety of techniques have been suggested to prevent undesirable image generations. Yet, these methods often only protect against specific user prompts and have been shown to allow unsafe generations with other inputs. Here we focus on unconditionally erasing a concept from a text-to-image model rather than conditioning the erasure on the user’s prompt. We first show that compared to input-dependent erasure methods, concept erasure that uses Task Vectors (TV) is more robust to unexpected user inputs, not seen during training. However, TV-based erasure can also affect the core performance of the edited model, particularly when the required edit strength is unknown. To this end, we propose a method called Diverse Inversion, which we use to estimate the required strength of the TV edit. Diverse Inversion finds within the model input space a large set of word embeddings, each of which induces the generation of the target concept. We find that encouraging diversity in the set makes our estimation more robust to unexpected prompts. Finally, we show that Diverse Inversion enables us to apply a TV edit only to a subset of the model weights, enhancing the erasure capabilities while better maintaining the core functionality of the model.

Refer to caption — Figure 1: Concept erasure methods often filter out only a tiny volume in input space. Top row: Erased Stable Diffusion (with the “Van Gogh” concept erased); bottom row: SD 1.4. We plot generations using various adversarially optimized prompt embeddings, located at different Cosine similarities from the embedding of the prompt “Van Gogh”. Values in square brackets represent cosine similarities in embedding space with the prompt “Van Gogh” and are ordered from left (input is far away from the concept name) to right (closer to the concept name). ESD continues to produce “Van Gogh” concepts when the input prompt is far away from the original concept name.

1 Introduction

The capacity of text-to-image (T2I) generative models to produce high-quality images has improved significantly over time. Consequently, growing concerns surround their potential for generating undesirable content. Such concerns include: ability to “deepfake” images of real people; ability to synthesize copyrighted materials; and production of Not-Safe-For-Work (NSFW) content. A direct approach to mitigate these would be to perform data filtering, i.e., removing all images depicting undesired concepts from the model’s training set. However, automatically web-scraped, massive datasets are extremely hard to filter, and imperfect filtering often compromises the safety or the legal compliance of the resulting generative models. Additionally, even if filtering were feasible, retraining already existing models from scratch due to changes in regulations is often impractical due to high costs. For brevity, we refer here to all kinds of undesirable generations as ‘unsafe’.

Several recently proposed methods claim to “sanitize” unsafe concepts from T2I generative models [16, 9, 8, 3, 28, 11, 39]. Yet, when evaluated on unexpected inputs, these methods exhibit significant vulnerabilities, and bypassing these methods is relatively easy for adversarial methods [34, 24]. More specifically, most existing model sanitization approaches excel at averting the production of unsafe content, conditioned on a specific input prompt, or sequence of tokens. However, T2I models end up learning a many-to-many map** from prompts to image space, and a sufficiently motivated adversary — with fairly modest compute effort — can discover other input prompts that trigger a target unsafe concepts.

In this paper, we make further contributions towards the challenge of eliminating unsafe concepts from T2I models. Provably perfect erasure is beyond the scope of this paper, and we address a humbler challenge. Our goal is to develop an unconditional version of concept erasure — namely, a method that can effectively implement concept erasure in a way that is agnostic to the choice of specific user prompts.

Our core idea is based on a recently emergent technique known as Task Vectors [14]. At a high level, a task vector (TV) represents a displacement in the model’s weight space that is a result of fine-tuning; [14] shows that TVs can be flexibly used via arithmetic operations to enable editing of large models. Crucially, TV-based editing is independent of any specific user input, and therefore we showcase its ability to supply unconditional safety to T2I models. To apply TV-based concept erasure, we first finetune the model to generate a specific concept or style, and refer to the obtained weight difference as our TV. Next, we subtract the TV (possibly multiplied by a scalar $\alpha$ ) from the original model, thereby erasing the unsafe concept.

In Sec. 3 we define a criterion we term unconditional safety, measuring the performance of input-independent concept erasure on a model. We hypothesize that for some models, given a long enough prompt, an effective adversarial input might always be possible [35]. As such, we limit the prompt to a maximal fixed length when measuring a method’s ability to sanitize the generations. Although evaluating large models with our criterion is impractical, we can use it to show that TVs supply unconditional safety on toy models.

Following the strong performance of TV edits for unconditional concept erasure on toy models, we investigate whether they can be applied to large T2I models without compromising the model’s core functionality. Namely, we wish to apply TV edits while optimizing the trade-off between the erasure of unsafe concepts and the preservation of the model functionality. We characterize this trade-off by a parameter defining the edit strength, which is a scalar multiplying the vector magnitude. To tune the value of this parameter without relying on any given prompt, we propose a method called Diverse Inversion. Diverse Inversion finds a large set of token embeddings in dense space, all aimed at generating the same concept we wish to delete. We optimize in parallel many token embeddings, each of which should induce the generation of the target concept we wish to delete in the T2I model.

Our optimization process contains two constraints: (i) Limiting the minimal pair-wise similarity between every pair of embeddings, to ensure diversity in the obtained set of prompts. (ii) Limiting the similarity between each embedding and the embedding of the natural language description of the concept name embedding. Diverse Inversion allows us to tune the value of $\alpha$ in a manner that better generalizes to prompts compared to using a single input prompt.

Finally, we investigate what value of the strength parameter $\alpha$ should be used for adjusting weights in the task vector. We find that our Diverse Inversion technique allows us to find a good value of $\alpha$ . Additionally, it allows us to select a subset of model weights to edit, achieving a better tradeoff between concept erasure and control task performance.

Summary of our contributions. (i) Showing that the vulnerability of current concept erasure methods is caused by their dependence on specific input prompts (Sec. 3.2) (ii) Demonstrating TV-based editing as an efficient method for input-independent concept erasure (Sec. 3.3) (iii) Proposing Diverse Inversion, an algorithm to find a diverse set of dense prompts corresponding to a target concept, and utilizing it to allow a better trade-off between concept-erasure and model performance (Sec. 4).

2 Related Work

Denoising Diffusion Models. Diffusion models are a class of generative models that iteratively refine a distribution through a Markov-based denoising process [12, 33]. The process starts with a noise vector, $x_{T}$ , and progressively denoises it over $T$ steps to reconstruct the original data $x_{0}$ . In practice, the model is trained to predict the noise, $\epsilon_{t}$ , at each timestep, $t$ , which is used to progressively denoise the image, $x_{t}$ . Latent diffusion models (LDM) [27] enhance efficiency by working in a lower-dimensional space learned by an autoencoder. The first component of LDM includes a pre-trained encoder, $\mathcal{E}$ , and decoder, $\mathcal{D}$ , trained on a large dataset of images. The encoder maps an image, $x$ , to a spatial latent code, $z=\mathcal{E}(x)$ , and the decoder reconstructs the original image from the latent code, $\mathcal{D}(\mathcal{E}(x))\approx x$ . The second component is a diffusion model trained to generate codes in the learned latent space. Given an input, $c$ , the LDM is trained to generate an image conditioned on $c$ using the following objective function:

\mathcal{L}=\mathbb{E}_{z\sim\mathcal{E}(x),t,c,\epsilon\sim\mathcal{N}(0,1)}% \Big{[}\|\epsilon-\epsilon_{\theta}(z_{t},c,t)\|_{2}^{2}\Big{]}

where $z_{t}$ is the latent code for time $t$ , and $\epsilon_{\theta}$ is the denoising network. During inference, a random noise tensor is sampled in latent space and gradually denoised to produce a latent code, $z^{\prime}$ . The latent code is then transformed into an image using the pre-trained decoder, $x^{\prime}=\mathcal{D}(z^{\prime})$ .

Concept-Erasure on T2I Models. Recently, several strategies have been developed to prevent generative models from producing undesirable images. Negative Prompt (NP) [3] and Safe Latent Diffusion (SLD) [28] suggest modifying the inference process to divert the final output from undesired concepts. Other approaches employ classifiers to alter the output [26, 1, 4]. Since inference guiding methods can be evaded with sufficient access to model parameters [32], subsequent works including Erased Stable Diffusion (ESD) [8], Selective Amnesia (SA) [11], Forget-Me-Not (FMN) [39], Ablating Concepts (AC) [16], and Unified Concept Editing (UCE) [9] advocate for fine-tuning Stable Diffusion model weights.

Jailbraking Generative Models. Deep neural networks are known for their brittleness and various algorithms are known for creating inputs that lead these models to produce undesirable outputs. In the context of Large Language Models (LLMs), the term “jailbreaks” refers to adversarial inputs that trigger unsafe, harmful, or unwanted responses from the model. Some jailbreaks have been discovered manually through experimentation or red-teaming [5], while others have been discovered through LLM generation [23, 38]. Jailbreaking techniques include adding or prefixing adversarial strings to the original request. In the realm of text-to-image models, despite undergoing research on concept erasure methods used to remove undesirable concepts from the weights [8, 9, 16, 3, 28, 39, 11], recent works have shown that they are still susceptible to adversarial inputs [34, 24].

As current concept erasure methods for T2I models are often reliant on protecting against specific user inputs, adversarial methods find other inputs that can induce unsafe generations. In particular, Tsai et al. [34] uses a CLIP text encoder to construct a concept vector; a vector in embedding space representing the unsafe content. It then uses a genetic algorithm [31] to find hard prompts that produce the concept vector in the embedding space. Additionally, Pham et al. [24] propose Concept Inversion, which is a method based on Textual Inversion [7] to search for word embeddings that circumvent concept erasure methods. Textual Inversion [7] learns to capture the user-provided concept by representing it through new “words” in the embedding space of a frozen T2I model without changing the model weights. In particular, the authors designate a placeholder string, $c_{*}$ , to represent the new concept the user wishes to learn. They replace the vector associated with the tokenized string with a learned embedding $v_{*}$ , in essence “injecting” the concept into the model vocabulary. The technique is referred to as Textual Inversion and consists of finding an approximate solution to the following optimization problem:

v_{*}=\operatorname*{arg\,min}_{v}\mathbb{E}_{z\sim\mathcal{E}(x),c_{*},% \epsilon\sim\mathcal{N}(0,1),t}\Big{[}\|\epsilon-\epsilon_{\theta}(z_{t},c_{*}% ,t)\|_{2}^{2}\Big{]}.

Task Vectors and Parameter Space Interpolations. Although neural networks are inherently non-linear, previous research has shown that interpolating the weights of two neural networks can preserve their high accuracy if they share a portion of their optimization trajectory [15, 6]. In the context of fine-tuning, accuracy consistently improves when the weights of a pre-trained model are gradually shifted towards its fine-tuned counterpart [22, 13, 37]. Beyond a single task, Matena & Raffel [22] discovered that averaging the weights of multiple models, fine-tuned on different tasks from the same starting point can result in a model with high accuracy on all the fine-tuning tasks. Li et al. [19] observed similar outcomes when averaging the parameters of language models fine-tuned across various domains. Wortsman et al. [36] found that averaging the weights of models fine-tuned on multiple tasks can improve accuracy on a new downstream task without additional training.

Interestingly, the weight difference learned during fine tuning can also be learned on one task and transferred to another to achieve a similar function. Like a vector, it can also be multiplied by a (possibly negative) scalar, and often conveys an appropriate meaning to the model function. Ilharco et al. [14] first compute a Task Vector (TV) as:

\tau=\theta_{ft}-\theta_{pre},

where $\theta_{pre}$ is the pre-trained model and $\theta_{ft}$ is the model fine-tuned on a selected set of tasks. Subtracting the TV, scaled by a constant $\alpha$ , from the pre-trained weights $\theta_{pre}$ will make the model perform worse on the selected tasks for which the fine-tuning process was done. On the other hand, adding a scaled TV will improve the model’s performance on the same tasks. Ilharco et al. [14] show that Task Vectors, scaled by $\alpha\in[0,1]$ , can be applied to CLIP classifiers and LLMs to alter their behavior. In this work, we show that Task Vectors can also be applied to text-to-image diffusion models (in particular, the UNet module in Stable Diffusion) to perform concept erasure.

3 Conditional and Unconditional Concept Erasure

3.1 Motivating analysis

We start by noticing that current concept erasure methods are input-dependent. Such methods rely on the concept name to suppress the generation of a targeted concept. For instance, ESD [8] fine-tunes the pre-trained diffusion U-Net model weights to remove a specific style or concept when conditioned on a specific prompt. The authors propose a fine-tuning loss that reduces the probability of generating an image $x$ based on the likelihood described by the textual description of the concept, aiming to reduce the probability of the target concept $c$ : $\mathbb{P}_{\theta^{*}}(x)\propto\frac{\mathbb{P}_{\theta}(x)}{\mathbb{P}_{% \theta}(c|x)^{\eta}}$ , where $\theta^{*}$ is the UNet weights of the diffusion model, $\theta$ is the original weights, $\eta$ is a scale power factor, and $\mathbb{P}(x)$ represents the distribution generated by the original model. Since the loss function depends on the concept name $c$ , we hypothesize that ESD only suppresses the generation of the targeted concept when explicitly prompted with its textual name.

To further investigate this, we inspect the input space of the SD 1.4 model, and look for different embeddings in dense space that would generate a target concept (e.g., “Van Gogh”). To this end, we use Textual Inversion [7] and limit it to different similarity ranges from the concept name (i.e., the embedding of the string “Van Gogh”). As can be seen in Fig. 1, for the unedited model (second row, “SD 1.4”) a large set of embeddings in dense space ranging in different similarities from the concept name, can all generate images featuring the target concept. Too far away from the concept name generations may gradually fail to reconstruct the concept.

While it is already known that common concept erasure methods may be circumvented by prompts not seen during the erasure process [34, 24], we find that they often filter only a small neighborhood around the embedding used for training. For example, with the ESD concept erasure method [8], we see similar inversion capabilities across most of the model input space. The input filtration claimed by [24] is performed only in proximity to a specific input prompt, as we illustrate in Fig. 2. Therefore, while existing methods are effective in blocking expected prompts, they are less robust to unexpected ones. Motivated by this analysis, we turn to the main goal of this paper: First, we define a notion of safety that goes beyond a specific user prompt. Second, we suggest an effective method for concept erasure that does not depend on a specific prompt.

3.2 Marginal, conditional, and absolute safety

A straightforward way to quantify the safety of a model against undesired generations is to estimate the marginal probability of an unsafe generation. Given a generative model $G$ , a prompt $p$ drawn from a distribution $D$ , and the set of unsafe generations $U$ , the marginal probability $P$ for unsafe generation is given as $S_{marginal}$ :

S_{marginal}=P_{p\sim D}(G(p)\in U).

(1)

We note that the marginal probability $P$ is also potentially affected by the noise distribution (e.g., the random noise which is the input for a diffusion model). Yet, explicitly calculating this marginal safety is not possible as the empirical distribution of test prompts may be unknown. A main reason for it is that adversaries may change the prompt distribution after the design of our model $G$ and its safety mechanisms.

As this notion of marginal safety is impractical, it is tempting to replace it with a criterion of conditional safety. This notion takes into account a set of a few prompts supplied by the user, $C$ , known to be related to the target concept we wish to erase. Our aim then would be to reduce the maximal probability of unsafe generation with any of the prompts $p\in C$ :

S_{conditional}=\max_{{p}\in C}P(G(p)\in U).

(2)

Optimizing $G$ for this safety criteria would ensure that all prompts within the set $C$ would induce an unsafe behavior with a probability $S_{conditional}$ at most. This safety criterion is often optimized by most existing concept erasure methods. Yet, optimizing this criterion would not yield any guarantee outside the set $C$ .

Therefore, we focus on a safety criterion that is independent from any user-supplied prompts we term unconditional safety. We suggest a safety criterion limiting the probability with which unsafe generation would occur, given a constrained input complexity (e.g., the prompt length). While input length is a good parameter for input complexity in many cases, with dense embedding we may instead limit the resolution in which the dense embedding is given. The resolution of a continuous input vector $v\in\mathbb{R}^{d}$ is closely related to the input prompt length [35]. For one example, encoding a higher resolution dense embedding corresponds to more bits of information [18]. Specifically, we denote the input complexity (measured by resolution or length) by $D_{L}$ and use it to write the unconditional safety criteria. Namely, the unconditional safety criteria $L_{uncod}$ is the minimal complexity $L$ (e.g., prompt length) for which we have a prompt $p\in D_{L}$ that induces an unsafe generation $G(p)\in U$ with a probability of at least $\varepsilon$ .

L_{uncod}=\text{ the minimal }L\ \ \text{s.t.}:\\ \max_{p\in D_{L}}P(G(p)\in U)>\varepsilon\\

(3)

We discuss other possible safety criteria in Sec. 6.

3.3 Task Vectors for unconditional safety

Although calculating the unconditional safety criterion $L_{uncod}$ is impractical for large values of $L$ , we can demonstrate its improvement on toy models. We hypothesize that prompt-independent concept erasure methods such as TV edits may provide better unconditional safety. To test this hypothesis we trained a toy model with a dense “prompt” space of dimensions $d=8$ . We trained our model to generate images from the MNIST [17] dataset (See SM for implementation details).

Figure 3: TV-based concept-erasure provides better unconditional safety. We plot the probability of unsafe generation with the most successful adversarial prompt from each given input complexity class (See Sec.3.2). While the input-dependent (finetune-based) concept erasure method is focused on protecting against undesired generations with a specific prompt, other prompts still produce unsafe generations with high probability. The input-independent (TV-based) erasure reduces the probability of unsafe generations compared both to the original and the input-independent models, across the different complexity classes.

We apply three different concept-erasure models to erase the MNIST digit 0 (“the target concept”) from our diffusion model: (i) Input dependent concept-erasure: We fine-tune the model to produce the remaining 9 digits when given 0 as conditional input. (ii) Input-independent concept-erasure: We utilize a TV edit for input-independent concept-erasure [14]. We fine-tune our model to generate only the target concept (digit 0), and then subtract the model weight change achieved by the fine-tuning process from the original model. This process is input-independent as we perform unconditional fine-tuning discarding the usage of conditional input embedding (iii) Original model: We also evaluate the original model, without concept erasure. For all models, we use a pre-trained classifier to automatically evaluate whether the target concept was indeed generated. Implementation details for the classifier and all three methods, along with examples of the faithfulness of our classifier to human semantics can be found in the SM.

We now turn to evaluate the unconditional safety criterion for the three models above. Specifically, we define the input complexity classes $D_{L}$ as the resolutions in which we perform an exhaustive search of the possible prompts in continuous space. For each complexity class $D_{L}$ , we explore a grid in $d$ dimensions, with the inspected values in each dimension comprised of $L$ equally spaced values, totaling $L^{d}$ points per grid. For example, the grid point $(0.0,1.0,1.0,0.0,-1.0,...)$ belongs to a low input complexity class, while the grid point $(0.4,0.8,0.6,-0.2,-1.0,...)$ belongs to a higher one. Intuitively, the higher the input complexity class we examine, the closer we are to an exhaustive search in continuous input space.

We can clearly see in Fig. 3 that the TV edit provides a much better unconditional safety $L_{uncod}$ guarantee. The original and fine-tuned models provide a non-trivial probability of generating the target concept, even for relatively low complexity parameters $L$ . Moreover, even for the fine-tuned concept erasure, we can find prompts in the medium complexity range ( $L=5$ ) that generate the target concept with very high probability. However, with the TV-based concept-erasure, the unsafe generation can be better mitigated across all examined complexity classes. This suggests that the TV-based erasure does not merely input-filter the model, but attenuates its ability to generate the unsafe concept more robustly across the input space.

Interpreting this result according to our unconditional safety criteria (Eq. 3), we find that only the input independent method provides a non-trivial bound of the unsafe generation probability for complexity parameters $L\geq 5$ . As our input space in this experiment is of dimension $d=8$ , complexity classes of large values of $L$ are infeasible to compute ( $D_{L=6}$ already contains $L^{d}=1679616$ possible dense prompts).

4 Diverse Inversion for Robust Concept Erasure Using Task Vectors

Motivated by the potential of TV-based editing as a method capable of improving the unconditional safety of T2I models, we now focus on applying this technique to larger models. Namely, we wish to erase unsafe concepts from large diffusion models while otherwise retaining their text-to-image capabilities. Measuring the degree of preservation of the desired text-to-image capabilities can be done directly, since this typically involves expected user inputs and outputs. However, anticipating the model’s reaction to adversarial prompts unknown at the time of editing can be challenging.

To estimate how well the model is protected against unexpected inputs, we would like to observe its outputs for a diverse array of adversarial prompts. We cannot inspect all the input prompts of a given length as we did for the toy model, due to the very large number of possible prompts. To this end, we create a diverse safety validation set composed of diverse input tokens that can all generate unsafe content with the original model. We note that a real-life adversary chooses their prompt after TV-based concept erasure has been applied, and not before it. Yet, the fact that an adversarial prompt often transfers well between erased and original (un-erased) models [24, 40] motivates us to rely on a large set of diverse adversarial prompts optimized for the original method. We show in Sec.5.2 that our method allows us to apply TV-based erasure to gain robustness to adversarial techniques applied after the erasure edit.

Our method for applying Task Vectors for concept erasure in large models consists of three parts. First, we learn a diverse set of adversarial prompts, allowing us to estimate TV edit robustness. Next, we show this learned set allows us to choose robust hyper-parameters for TV edits while maintaining the model utility. Finally, we show that we can not only choose performant hyper-parameter values but also sub-select the set of model parameters we wish to edit for better performance.

4.1 Diverse Inversion

As we discuss in Sec. 3, concept erasure methods can provide a false sense of security by performing “input-filtering”. This suggests that additional inputs are needed to better evaluate concept erasure methods. We would like to have a diverse set on inputs, evaluating the concept erasure capability independently from any specific adversarial prompt. Yet, our experiment in Fig. 1 also shows that in addition to suppressing the harmful generation when prompted with the concept name, the sanitized Stable Diffusion model also sanitizes surrounding word embeddings. Therefore, these additional inputs need to be far from the embedding of the concept name as well as sufficiently diverse. To create a better list of inputs for robust evaluation of concept erasure, one can search for word embeddings as follows:

		$\displaystyle\mathbf{v}_{}=\operatorname{arg\,min}_{\mathbf{v}}\mathbb{E}_{z% \sim\mathcal{E}(x),c_{},\epsilon\sim\mathcal{N}(0,1),t}\Big{[}\\|\epsilon-% \epsilon_{\theta}(z_{t},c_{},t)\\|_{2}^{2}\Big{]},$		(4)
		$\displaystyle\text{s.t. }\left\{\begin{aligned} &\text{Sim}(v_{i},v_{\text{% concept}})\in[a,b]\text{for}i=1,2,\ldots,n,\\ &\text{Sim}(v_{i},v_{j*})\in[c,d]\text{for}i,j=1,2,\ldots,n,i\neq j.\end{% aligned}\right.$		(4)

In Eq. 4, we optimize for a set of embeddings $\mathbf{v}_{*}=(v_{1},v_{2},...,v_{n})$ . The first constraint ensures that the learned embeddings are not too close to the embedding of the concept name (e.g. Van Gogh). The second constraint pushes the learned embeddings away from each other to diversify them. Nevertheless, the optimization procedure in Eq. 4 can be highly non-convex, and we found that vanilla inversion with random restart can be used as an approximation to learn sufficiently diverse embeddings for our proposed erasure method.

4.2 Tuning the TV edit strength

With the augmented set of inputs that all make Stable Diffusion generate images of the target concept, we can choose the parameter $\alpha$ that controls the edit strength of the TV. We look for a value that suppresses any such generation with a prompt from our Diverse Inversion set. In a recent result, Pham et al. [24] show that a robust model might not always be usable in practice. I.e. the model outputs for non-adversarial prompts will not align with the prompts’ semantics. Hence, we also measure the model performance on control tasks featuring unrelated concepts. Examining both measures we can vary the value of $\alpha$ to create a scatter plot and pick the $\alpha$ that yields the desired trade-off between robustness and usability (Sec. 5.2).

4.3 Sub-selecting TV weights

In our experiments, larger values of $\alpha$ tend to make the Stable Diffusion model more robust against inversion. However, this can also affect the model performance on unrelated tasks. Motivated by [10, 21], we hypothesize that not all layers on the UNet need to be edited. We test our hypothesis by not editing certain blocks of the UNet. In other words, we suggest pruning certain layers of the TV weights.

(a) “French Horn”

(b) “Art by Thomas Kinkade”

Figure 4: The trade-off between erasure score and control task performance. We plot the robustness measured according to erasure score, lower is better, and control task performance, higher is better, for models erased with different TV edit strengths (parameterized by

\alpha

). Our Diverse Inversion method allows us to explore the trade-off between concept erasure robustness and model utility when editing different subsets of the model parameter. We discover that different target concepts may benefit from editing different subsets of model parameters.

5 Experiments

5.1 Experimental setup

Metrics: To assess the content of the generated images, we use CLIP ViT-B/32 [25] pre-trained on LAION-2B [29]. Following previous works [24, 11, 8, 9], the control task for all experiments is the average CLIP similarity score of 6 concepts across 3 different concept categories (artistic style, objects, and specific people): “art by Kilian Eng”, “art by Picasso”, “garbage truck”, “chain saw”, “Brad Pitt”, and “Angelina Jolie”. Motivated by our experiment in Sec. 3, we propose to use a metric known as Erasure Score to validate the robustness of the edited Stable Diffusion model to many different attack prompts. The metric is defined as follows: after obtaining word embeddings via Diverse Inversion, we generate an image for each learned embedding and the concept name from the Stable Diffusion model. Erasure Score (ES) is defined as the maximum (calculated over all generated images) CLIP similarity between the generated images and the concept name. A lower Erasure Score indicates more robustness against adversarial inputs. Our results on robustness to different adversarial methods are demonstrated qualitatively in Figs. 6 and 5, and quantitatively in the supplementary material (SM).

Implementation Details: For calculating the Erasure Score we use our Diverse Inversion method to find $100$ word embeddings that trigger the generation of the erased concept. These embeddings lie evenly across 5 non-overlap** intervals, where each interval represents the allowed similarity between the learned embeddings and the embedding of the concept name. Moreover, we ensure that the pairwise similarity between learned embeddings within each interval cannot be too large. To obtain the TV for concept erasure, we first fine-tune the UNet of the Stable Diffusion model on a combination of synthetic and real images of the targeted concept, using the empty string as the caption. The fine-tuned UNet is then used to compute TV for the editing procedure. Further details appear in the SM.

5.2 Results

We demonstrate in Figs. 5 and 6 that our method provides robustness to current adversarial methods applied after the concept erasure edit. In the second row in each of the sub-figures of Fig. 5 we show that for certain values of the edit strength $\alpha$ , the Stable Diffusion model manages to suppress the generation of the targeted concept when explicitly prompted with the same concept name. However, when Concept Inversion [24] is applied, we can still recover the erased concept. On the other hand, when $\alpha$ is increased, we obtain both a lower Erasure Score (ES) and a more robust erased model. This suggests that we can use the Erasure Score to guide us in selecting an appropriate edit strength, $\alpha$ , to make the model more robust against adversarial inputs. We also test our edited models against hard prompts obtained from the Ring-A-Bell method [34]. Fig. 6 shows that the adversarial prompts manage to fully circumvent 7 concept erasure methods but are unable to recover the target concepts erased using TV.

A notable drawback of using Task Vectors (TV) is that this method requires using significantly higher values of $\alpha$ to enhance the Stable Diffusion model’s robustness against adversarial inputs. Consequently, this might compromise the model’s generative performance on concepts unrelated to the erased concept. Fig. 4 demonstrates that certain layers of TV can be pruned to better preserve generative performance on unrelated concepts, while still maintaining robustness against adversarial inputs.

6 Discussion

Model innocence as a safety goal. It is tempting to pursue other definitions of safety as well, such as model innocence which stipulates that a model behaves similarly to one that was never exposed to any examples of unsafe behaviors. Yet, beyond the questions of practicality, an innocent model may still be unsafe. For example, an advanced enough model that understands the concept of something being safe-for-work, and the concept of negation, may be expected to allow the creation of various not-safe-for-work images [2].

Absolute safety. Making it impossible to generate unwanted content may seem to represent the absolute ideal in model safety. However, completely avoiding unsafe content depends on the ability to recognize all unsafe behaviors. The ability to describe all unsafe behaviors without defining such behaviors in advance is therefore left for further research [2].

7 Limitations

Provable guarantees for erasure. An inherent weakness of any erasure method is the inability to evaluate them in advance against yet unknown future adversarial methods [2]. We acknowledge this as a weakness of our suggested method as well. Yet, we provide not only results against current adversarial methods but also a principled analysis of the unique qualities of TV-based concept erasure being input-independent; this is one of our main contributions.

TV-based erasure. Our suggested method is reliant on TV techniques. Yet, the parameter space of neural networks is far from being completely understood [20]. This means that the exact cases where TV-based erasure can work or fail are not clear yet. The application of Task Vectors for more fine-grained, or coarse-grained concepts, is yet to be explored. Similarly, it is not clear yet how to apply our Diverse Inversion techniques to other modalities such as language. Additionally, when erasing an excessive number of concepts it is not clear how to avoid a significant deterioration of the control task performance.

Dependence on the Diverse Inversion set. While we claim to suggest an input-independent concept erasure method, our method is dependent on the discovered Diverse Inversion set. Nevertheless, we only use it to tune hyper-parameters: the TV edit strength, and the identity of the edited layers. Therefore, our edit is not only independent of any user-supplied prompt but also, apart from hyper-parameters, independent of the embedding found using the Diverse Inversion method.

8 Conclusions

We propose adapting Task Vectors (TV), a recently proposed technique for model editing, for erasing concepts from generative models. On a range of test cases, we demonstrate how TVs can be used to sanitize undesirable concepts from text-to-image models in a way that is independent of specific user prompts. This distinguishes TV from existing methods in the literature and makes it more robust. Our method, Diverse Inversion, enables us to better maintain model utility while removing harmful concepts. We anticipate that our method will be of interest to the broader AI safety community, and can be extended to other model families such as large language models (LLMs) and other multimodal vision-language models.

Acknowledgments

The authors were partially supported by the AI Research Institutes Program supported by NSF and USDA-NIFA under grant no. 2021-67021-35329, NSF SaTC grant 2154119, and a Cyber NYC gift from Google Research. KM was supported by a US Department of Education GAANN fellowship. NC was partially supported by the Israeli data science scholarship for outstanding postdoctoral fellows (VATAT).

References

AI [2022] Stability AI. Stable diffusion 2.0 release, 2022. Jul 9, 2023.
Amodei et al. [2016] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.
AUTOMATIC1111 [2022] AUTOMATIC1111. Negative prompt, 2022.
Bedapudi [2022] Praneeth Bedapudi. Nudenet: Neural nets for nudity detection and censoring, 2022.
Chao et al. [2023] Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries. CoRR, abs/2310.08419, 2023.
Frankle et al. [2020] Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, 2020.
Gal et al. [2023] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit Haim Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. In International Conference on Learning Representations, 2023.
Gandikota et al. [2023a] Rohit Gandikota, Joanna Materzynska, Jaden Fiotto-Kaufman, and David Bau. Erasing concepts from diffusion models. In International Conference on Computer Vision, 2023a.
Gandikota et al. [2023b] Rohit Gandikota, Hadas Orgad, Yonatan Belinkov, Joanna Materzynska, and David Bau. Unified concept editing in diffusion models. In IEEE/CVF Winter Conference on Applications of Computer Vision, 2023b.
Hase et al. [2023] Peter Hase, Mohit Bansal, Been Kim, and Asma Ghandeharioun. Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models. In Annual Conference on Neural Information Processing Systems, 2023.
Heng and Soh [2023] Alvin Heng and Harold Soh. Selective amnesia: A continual learning approach to forgetting in deep generative models. In Advances in Neural Information Processing Systems, 2023.
Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020.
Ilharco et al. [2022] Gabriel Ilharco, Mitchell Wortsman, Samir Yitzhak Gadre, Shuran Song, Hannaneh Hajishirzi, Simon Kornblith, Ali Farhadi, and Ludwig Schmidt. Patching open-vocabulary models by interpolating weights. In Advances in Neural Information Processing Systems, 2022.
Ilharco et al. [2023] Gabriel Ilharco, Marco Túlio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. In International Conference on Learning Representations, 2023.
Izmailov et al. [2018] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry P. Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization. In Conference on Uncertainty in Artificial Intelligence, 2018.
Kumari et al. [2023] Nupur Kumari, Bingliang Zhang, Sheng-Yu Wang, Eli Shechtman, Richard Zhang, and Jun-Yan Zhu. Ablating concepts in text-to-image diffusion models. In International Conference on Computer Vision, 2023.
LeCun et al. [2010] Yann LeCun, Corinna Cortes, and Christopher J.C. Burges. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
Li et al. [2008] Ming Li, Paul Vitányi, et al. An introduction to Kolmogorov complexity and its applications. Springer, 2008.
Li et al. [2022] Margaret Li, Suchin Gururangan, Tim Dettmers, Mike Lewis, Tim Althoff, Noah A. Smith, and Luke Zettlemoyer. Branch-train-merge: Embarrassingly parallel training of expert language models. CoRR, abs/2208.03306, 2022.
Ma et al. [2020] Chao Ma, Stephan Wojtowytsch, Lei Wu, et al. Towards a mathematical understanding of neural network-based machine learning: what we know and what we don’t. arXiv preprint arXiv:2009.10713, 2020.
Maini et al. [2023] Pratyush Maini, Michael Curtis Mozer, Hanie Sedghi, Zachary Chase Lipton, J. Zico Kolter, and Chiyuan Zhang. Can neural network memorization be localized? In International Conference on Machine Learning, 2023.
Matena and Raffel [2022] Michael Matena and Colin Raffel. Merging models with fisher-weighted averaging. In Advances in Neural Information Processing Systems, 2022.
Perez et al. [2022] Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models. In Conference on Empirical Methods in Natural Language Processing, 2022.
Pham et al. [2024] Minh Pham, Kelly O. Marshall, Niv Cohen, Govind Mittal, and Chinmay Hegde. Circumventing concept erasure methods for text-to-image generative models. In International Conference on Learning Representations, 2024.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021.
Rando et al. [2022] Javier Rando, Daniel Paleka, David Lindner, Lennart Heim, and Florian Tramèr. Red-teaming the stable diffusion safety filter. In Advances in Neural Information Processing Systems Workshop, 2022.
Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Conference on Computer Vision and Pattern Recognition, 2022.
Schramowski et al. [2023] Patrick Schramowski, Manuel Brack, Björn Deiseroth, and Kristian Kersting. Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models. In Conference on Computer Vision and Pattern Recognition, 2023.
Schuhmann et al. [2022] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5B: an open large-scale dataset for training next generation image-text models. In Advances in Neural Information Processing Systems, 2022.
Sehwag [2021] Vikash Sehwag. Minimal implementation of diffusion models, 2021.
Sivanandam and Deepa [2008] S. N. Sivanandam and S. N. Deepa. Introduction to genetic algorithms. Springer, 2008.
SmithMano [2022] SmithMano. Tutorial: How to remove the safety filter in 5 seconds, 2022.
Sohl-Dickstein et al. [2015] Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning Workshop, 2015.
Tsai et al. [2024] Yu-Lin Tsai, Chia-Yi Hsu, Chulin Xie, Chih-Hsun Lin, Jia-You Chen, Bo Li, Pin-Yu Chen, Chia-Mu Yu, and Chun-Ying Huang. Ring-a-bell! how reliable are concept removal methods for diffusion models? In International Conference on Learning Representations, 2024.
Wolf et al. [2023] Yotam Wolf, Noam Wies, Yoav Levine, and Amnon Shashua. Fundamental limitations of alignment in large language models. arXiv preprint arXiv:2304.11082, 2023.
Wortsman et al. [2022a] Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, 2022a.
Wortsman et al. [2022b] Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, and Ludwig Schmidt. Robust fine-tuning of zero-shot models. In Conference on Computer Vision and Pattern Recognition, 2022b.
Wu et al. [2024] Yuanwei Wu, Xiang Li, Yixin Liu, Pan Zhou, and Lichao Sun. Jailbreaking gpt-4v via self-adversarial attacks with system prompts. CoRR, abs/2311.09127, 2024.
Zhang et al. [2023] Eric J. Zhang, Kai Wang, Xingqian Xu, Zhangyang Wang, and Humphrey Shi. Forget-me-not: Learning to forget in text-to-image diffusion models. CoRR, abs/2303.17591, 2023.
Zou et al. [2023] Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. CoRR, abs/2307.15043, 2023.

\thetitle

Supplementary Material

CAUTION: This section includes generated content that may contain offensive or distressing material.

9 Additional Implementation Details

For all of our experiments except the toy MNIST one, we use Stable Diffusion 1.4 (SD 1.4). To compute the TV for SD 1.4, we fine-tune the UNet component on 15 synthetic images and 15 real images obtained from Google Images (30 in total). The synthetic images are obtained from the unedited SD 1.4 using the prompt “a photo of [object name]” for object concepts, and “a painting in the style of [artist name]” for art style concepts. We fine-tune for 1000 steps using a learning rate of $1e-05$ .

10 Images for MNIST Experiment

Fig. 7 demonstrates that both Fine-tuning and TV can be used to erase the digit 0 from the toy diffusion model. We also use a pre-trained classifier to quantitatively assess the generation quality of the edited models in Tab. 1. Both editing methods can erase the target class when the model is given $0$ as conditional input, while preserving generative performance on other classes. However, Fig. 8 shows that TV is more robust against inversion. For the toy diffusion models, we used implementation from [30]. However, we modify the architecture to support conditional embeddings of dimension $d=8$ , and normalize the input embeddings during training and inference. Such modifications are made to make the space of input embeddings that generate actual digits more compact. This makes the model more likely to generate faithful images when given our sampled embeddings as conditional input. We train and fine-tune using a batch size of $512$ for $100$ epochs.

Table 1: Classification accuracy (%) on generated images

	Target class	Other classes
Original	98.2	98.1
Fine-tuning	1.4	97.3
Task Vector	0.4	97.3

10.1 Ablation for Diverse Inversion

To study the necessity of Diverse Inversion, we also perform vanilla Textual Inversion (TI) [7] to find $50$ word embeddings for Van Gogh style. Fig. 9 suggests that without the additional constraints, the cosine similarities between the learned embeddings through vanilla TI and the embedding of the concept name will center around $0.0$ . However, with Diverse Inversion, we can enhance the diversity of our learned embeddings by controlling such cosine similarities taken with respect to the concept name. Fig. 10 shows samples of SD 1.4 when we used the learned embeddings of Diverse Inversion as conditional input.

10.2 Additional Results

We provide additional results on the robustness of TV in Tab. 2. We followed a similar setup to Pham et al. [24] to quantify the robustness of erasure methods on objects. In particular, we first generate 500 images for the prompt with the concept name and 500 images with Concept Inversion. We then use a pre-trained classifier to measure the presence of the erased object in the generated images.

Table 2: Quantitative results of Concept Inversion for object concept (Acc. % of erased model / Acc. % of CI): Using Concept Inversion, we can generate images of the erased objects, which can be seen by an increase in average accuracy across 4 concept erasure methods except Task Vector. “SD 1.4” features the accuracy of the original model with the concept name. Accuracy is measured with a classifier as in [8].

{tblr}

width = colspec = Q[150]Q[100]Q[100]Q[100]Q[100]Q[100]Q[100], columneven = c, column2-6 = c, hline1 = 2-7, hline2 = -, hline12 = -, colsep=2pt, rowsep=0.1pt, & SD 1.4 TV ESD UCE NP SLD-Med
cassette player 6.4 2.0 / 0.0 0.2 / 6.2 0.0 / 2.8 4.0 / 9.4 1.0 / 2.4
chain saw 68.6 1.2 / 0.3 0.0 / 64.0 0.0 / 43.6 4.0 / 82.8 0.8 / 86.6
church 79.6 12.4 / 0.4 0.8 / 87.4 10.0 / 82.2 25.4 / 78.4 20.6 / 72.0
english springer 93.6 9.1 / 0.3 0.2 / 48.2 0.0 / 69.6 27.0 / 90.4 24.6 / 96.4
french horn 99.3 26.1 / 0.0 0.0 / 81.6 0.4 / 99.4 62.4 / 99.0 17.0 / 97.6
garbage truck 83.2 9.2 / 0.4 0.8 / 57.0 16.4 / 89.6 39.4 / 84.6 19.8 / 94.8
gas pump 76.6 3.2 / 0.3 0.0 / 73.8 0.0 / 73.0 18.0 / 79.6 12.8 / 75.6
golf ball 96.2 13.4 / 0.5 0.0 / 28.6 0.2 / 18.6 45.2 / 88.4 60.2 / 98.8
parachute 96.2 19.2 / 0.0 0.0 / 94.2 1.6 / 94.2 32.8 / 77.2 52.8 / 95.8
tench 79.6 12.2 / 0.1 0.3 / 59.7 0.0 / 20.6 27.6 / 72.6 20.6 / 75.4
Average 77.9 10.8 / 0.2 0.2 / 60.1 2.9 / 59.4 28.6 / 76.2 23.0 / 79.5