Deceptive Diffusion: Generating Synthetic Adversarial Examples

Lucas Beerens School of Mathematics, University of Edinburgh, EH9 3FD, UK. Supported by MAC-MIGS Centre for Doctoral Training under EPSRC grant EP/S023291/1. ([email protected])    Catherine F. Higham School of Computing Science, University of Glasgow, Sir Alwyn Williams Building, Glasgow, G12 8QQ. Supported by EPSRC grant EP/T00097X/1. ([email protected])    Desmond J. Higham School of Mathematics, University of Edinburgh, EH9 3FD, UK. Supported by EPSRC grant EP/V046527/1. ([email protected])
Abstract

We introduce the concept of deceptive diffusion—training a generative AI model to produce adversarial images. Whereas a traditional adversarial attack algorithm aims to perturb an existing image to induce a misclassificaton, the deceptive diffusion model can create an arbitrary number of new, misclassified images that are not directly associated with training or test images. Deceptive diffusion offers the possibility of strengthening defence algorithms by providing adversarial training data at scale, including types of misclassification that are otherwise difficult to find. In our experiments, we also investigate the effect of training on a partially attacked data set. This highlights a new type of vulnerability for generative diffusion models: if an attacker is able to stealthily poison a portion of the training data, then the resulting diffusion model will generate a similar proportion of misleading outputs.

1 Motivation

In this work, we combine two types of algorithm that have come to prominence in artificial intelligence (AI): adversarial and generative. Adversarial attack algorithms are designed to reveal vulnerabilities in classification systems; for example by perturbing a chosen image in a way that is imperceptible to the human eye, but causes a change in classification [11, 29]. Generative models are designed to create outputs that are similar to, but not simply copies of, the examples on which they were trained [9, 14]. Here, we show that by training on data that consists of adversarially perturbed images, a generative diffusion model can be made to create fresh examples of adversarial images that do not correspond directly to any underlying real images.

In section 2 we give some background information on the two main ingredients of our work: adversarial attack algorithms and generative diffusion models. Section 3 describes the results of computational experiments where we investigate the idea of training a diffusion model on adversarially-perturbed data. We finish with a brief discussion in Section 4.

1.1 Related Work

We refer to [5] for an overview of recent attempts to use generative AI tools to produce adversarial inputs. So far, the AdvDiffuser algorithm of [5] appears to be the first and only approach to generating new, synthesized, examples of adversarial images using a diffusion model. In that work, the authors take an existing, trained diffusion model and adapt the denoising, or backward, process by adding adversarial perturbations at each time step. This change increases computational complexity, since an extra gradient step is required at each time point. Our approach differs by building a new diffusion model, which then generates images with a standard de-noising algorithm. In addition to lowering the computational cost, our deceptive diffusion method reveals a new type of security threat that arises when standard generative diffusion models are created on training data that has been attacked. In particular, we find that the drop in classification success is in direct proportion to the fraction of training data that is adversarially perturbed. Hence, if an attacker is able to poison some portion of the training data, the builders of a generative diffusion model may inadvertently create a tool that produces a corresponding proportion of adversarial images.

2 Background

2.1 Adversarial Attack Algorithms

State of the art image classification tools are known to possess inherent vulnerabilites. In particular, they can be fooled by adversarial attacks, where an existing image undergoes a small perturbation that would not be noticeable to a human, but causes a change in the predicted class. Since this effect was first pointed out, [11, 29], a wide range of attack and defence strategies have been put forward, [2, 3, 21, 22, 23], and bigger picture questions concerning the inevitability of attack success have been investigated, [7, 10, 27, 30, 31]. The susceptibility of AI systems to attack is a serious issue in many application areas and it is pertinent to the recent calls for AI regulation. For example, the amendment of June 2023 [24] to Article 15 – paragraph 4 – subparagraph 1 of the EU AI act [8] requires that: “High-risk AI systems shall be resilient as regards to attempts by unauthorised third parties to alter their use, behaviour, outputs or performance by exploiting the system vulnerabilities.”

2.2 Generative Diffusion Models

A generative diffusion model for creating realistic, but synthetic, images can be built by first training a neural network to de-noise a collection of noisy images, and then asking the network to de-noise a new sample of pure noise [4].

In Algorithms 1 and 2 we summarize the basic unconditional diffusion model setting from [14]; see also [13, 20] for detailed explanations of the steps involved. Here, the αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are parameters taking values between zero and one. They have the form αt=1βt,subscript𝛼𝑡1subscript𝛽𝑡\alpha_{t}=1-\beta_{t},italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , where the predetermined sequence β1,β2,,βTsubscript𝛽1subscript𝛽2subscript𝛽𝑇\beta_{1},\beta_{2},\ldots,\beta_{T}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_β start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is known as the variance schedule. In [14], linearly increasing values from β1=104subscript𝛽1superscript104\beta_{1}={10}^{-4}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT to βT=0.02subscript𝛽𝑇0.02\beta_{T}=0.02italic_β start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = 0.02 are used. We also let

α¯t=i=1tαi,subscript¯𝛼𝑡superscriptsubscriptproduct𝑖1𝑡subscript𝛼𝑖\overline{\alpha}_{t}=\prod_{i=1}^{t}\alpha_{i},over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,

and

σq2(t)=(1αt)(1α¯t1)1α¯t.subscriptsuperscript𝜎2𝑞𝑡1subscript𝛼𝑡1subscript¯𝛼𝑡11subscript¯𝛼𝑡\sigma^{2}_{q}(t)=\frac{(1-\alpha_{t})(1-\overline{\alpha}_{t-1})}{1-\overline% {\alpha}_{t}}.italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_t ) = divide start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG .

In step 5 of Algorithm 1 ϵθsubscriptbold-italic-ϵ𝜃\boldsymbol{\epsilon}_{\theta}bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT denotes the output from a neural network. Given a version of the noisy image, α¯t𝐱0+1α¯tϵsubscript¯𝛼𝑡subscript𝐱01subscript¯𝛼𝑡bold-italic-ϵ\sqrt{\overline{\alpha}_{t}}\,\mathbf{x}_{0}+\sqrt{1-\overline{\alpha}_{t}}\,% \boldsymbol{\epsilon}square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ, corresponding to a time t𝑡titalic_t, the job of the network is to predict the noise ϵbold-italic-ϵ\boldsymbol{\epsilon}bold_italic_ϵ. Here, a simple least-squares loss function is used.

Algorithm 1 Training with the forward process [14]
1:repeat
2:     𝐱0q(𝐱0)similar-tosubscript𝐱0𝑞subscript𝐱0\mathbf{x}_{0}\sim q(\mathbf{x}_{0})bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_q ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) \triangleright choose an image from training set
3:     tUniform({1,2,,T})similar-to𝑡Uniform12𝑇t\sim\mathrm{Uniform}(\{1,2,\ldots,T\})italic_t ∼ roman_Uniform ( { 1 , 2 , … , italic_T } )
4:     ϵN(𝟎,𝐈)similar-tobold-italic-ϵN0𝐈\boldsymbol{\epsilon}\sim\mathrm{N}(\mathbf{0},\mathbf{I})bold_italic_ϵ ∼ roman_N ( bold_0 , bold_I ) \triangleright standard Gaussian sample
5:     Take gradient step w.r.t. θ𝜃\thetaitalic_θ on ϵϵθ(α¯t𝐱0+1α¯tϵ,t)22superscriptsubscriptnormbold-italic-ϵsubscriptbold-italic-ϵ𝜃subscript¯𝛼𝑡subscript𝐱01subscript¯𝛼𝑡bold-italic-ϵ𝑡22\|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_{\theta}(\sqrt{\overline{\alpha}% _{t}}\,\mathbf{x}_{0}+\sqrt{1-\overline{\alpha}_{t}}\,\boldsymbol{\epsilon},t)% \|_{2}^{2}∥ bold_italic_ϵ - bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ , italic_t ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
6:until converged

Algorithm 2 from [14] summarizes the sampling process. Here, in step 1 a set of pure noise pixel values is de-noised from time T𝑇Titalic_T to time 00 in order to produce a new synthetic image.

Algorithm 2 Sampling with the backward process [14]
1:𝐱TN(𝟎,𝐈)similar-tosubscript𝐱𝑇N0𝐈\mathbf{x}_{T}\sim\mathrm{N}(\mathbf{0},\mathbf{I})bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ roman_N ( bold_0 , bold_I ) \triangleright standard Gaussian sample
2:for t=T,T1,,1𝑡𝑇𝑇11t=T,T-1,\ldots,1italic_t = italic_T , italic_T - 1 , … , 1 do
3:     𝐳N(𝟎,𝐈)similar-to𝐳N0𝐈\mathbf{z}\sim\mathrm{N}(\mathbf{0},\mathbf{I})bold_z ∼ roman_N ( bold_0 , bold_I ) \triangleright standard Gaussian sample
4:     𝐱t1=1αt(𝐱t1αt1α¯tϵθ)+σq(t)𝐳subscript𝐱𝑡11subscript𝛼𝑡subscript𝐱𝑡1subscript𝛼𝑡1subscript¯𝛼𝑡subscriptbold-italic-ϵ𝜃subscript𝜎𝑞𝑡𝐳\mathbf{x}_{t-1}=\frac{1}{\sqrt{\alpha_{t}}}\left(\mathbf{x}_{t}-\frac{1-% \alpha_{t}}{\sqrt{1-\overline{\alpha}_{t}}}\,\boldsymbol{\epsilon}_{\theta}% \right)+\sigma_{q}(t)\,\mathbf{z}bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) + italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_t ) bold_z
5:end for
6:return 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

3 Experimental Results

We now outline the key components in our computational experiments.

We use the MNIST data set [18], which contains 60,000 training images and 10,000 test images of handwritten digits, with labels indicating the categories: ‘0’, ‘1’,‘2’,…,‘9’.

As a classifier, we use a convolutional neural network based on the architecture of LeNet [1, 17]. The exact architecture can be found in our code. After training, this classifier achieves an accuracy of 99.02% on the test images.

For the adversarial attack algorithm we use PGDL2 [15], a PyTorch implementation of the projected gradient method from [21]. This attack algorithm uses a robust optimization approach to seek an optimal perturbation in an 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT sense, using gradients of the loss function. We use the default setting in PGDL2 where an attack is declared successful if it finds a sufficiently small class-changing perturbation within a specified number of iterations of a first order gradient method. The bound on the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm of the attack was set to 2 (each of the 784 pixels takes values between 00 and 1111). We chose a large bound of 1000 on the number of iterations in order to maximize the size of the attacked image dataset for the training the diffusion model. We used PGDL2 in untargeted mode, so that any change of classification is acceptable.

In the diffusion model, we use a neural network with a UNet2DModel architecture from

https://huggingface.co/docs/diffusers/en/api/models/unet2d

which is motivated by the original version in [26].

3.1 Initial Sanity Check

Before moving on to adversarial images, we first report on an initial test which confirms that the diffusion model is capable of producing outputs that are acceptable to the classifier.

In this test, we train the diffusion model using the original MNIST training data. We supply the labels during the training process, so we use a conditional version of Algorithm 1, where in step 5 the network learns to remove noise and produce an image when given both a time t𝑡titalic_t and a label. This is built in to the UNet2DModel. A trainable encoder maps the label into the same space as the timestep. These two quantities are then added and passed to the model in the same way that the time is usually passed [25].

Having the trained the diffusion model, we found that 99.5% of its outputs were classified with the intended label. Figure 1 gives a confusion matrix which breaks the results down by category. So, for example, for the label ‘7’, we found that 98% of the outputs from the diffusion model were classified as sevens, and 2% were classified as threes.

Refer to caption
Figure 1: Confusion matrix for diffusion model trained on the 60,000 MNIST training images. With training images corresponding to each label (row) we show the frequency with which the classifier assigned each label (column). Entries on the diagonal therefore correspond to successfully created new images. Overall success rate is 99.5%.

3.2 Deceptive Diffusion Model

Our aim is now to build a deceptive diffusion model that takes a label i𝑖iitalic_i and generates a new image that looks like digit i𝑖iitalic_i but is misclassified.

Using PGDL2 for untargeted attacks on the 60,000 MNIST training images gave a success rate of 86.5%, thereby producing 51,918 perturbed images that are classified differently to their nearby original images. We trained the diffusion model on these adversarial images, using the original labels. Figure 2 illustrates the process. Here, the image of the three on the left is from the MNIST training set, and the image in the middle arises from a successful attack by PDL2 (classified as an eight). After training the diffusion model on all 51,918 adversarial images, asking for an output from the ‘3’ category produced the result shown (classified as a five).

Refer to caption
Figure 2: Building the deceptive diffusion model. Images that were successfully attacked by PGDL2 are used as training data, with the original labels retained. The trained diffusion model, Gθfinalsubscript𝐺subscript𝜃finalG_{\theta_{\text{final}}}italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT final end_POSTSUBSCRIPT end_POSTSUBSCRIPT, produces adversarial images associated with a given a label. (For the images in this diagram, the image from PGDL2 is classified as an ‘8’ and the image from the deceptive diffusion model is classified as a ‘5’.)

After using the trained diffusion model to generate 100 new images from each of the ten categories and passing these through the CNN classifier, we found that 93.6% of the outputs were classified differently to their requested labels. Figure 3 gives a confusion matrix showing the performance by category. For comparison, Figure 4 shows a confusion matrix for the PGDL2 attacks on the 60,000 training images.

Table 1 shows the correlation between the rows of the confusion matrices in Figures 3 and 4. The high correlation values indicate that the two confusion matrices are similar. We emphasize that PGDL2 was used in untargeted mode: an image from category i𝑖iitalic_i can be perturbed so that the classifier predicts any new category ji𝑗𝑖j\neq iitalic_j ≠ italic_i. From Table 1 we see that although the deceptive diffusion model was not provided with the new class information j𝑗jitalic_j, it tends to produce new ijmaps-to𝑖𝑗i\mapsto jitalic_i ↦ italic_j misclassifications of the same type as PGDL2.

Refer to caption
Figure 3: Confusion matrix for the deceptive diffusion model. For a given label (row) we show the frequency with which the classifier assigned each label (column) to the output. Entries on the diagonal therefore correspond to unsuccessful attempts to create an adversarial image. Overall misclassification rate is 93.6%.
Refer to caption
Figure 4: Confusion matrix for PGDL2 attacks on the 60,000 MNIST training images. With training images corresponding to each label (row) we show the frequency with which the classifier assigned each label (column) after the attack. Entries on the diagonal therefore correspond to unsuccessful attacks. Overall success rate is 86.5%.
Class 0 1 2 3 4 5 6 7 8 9
Correlation 0.90 0.97 0.88 0.79 0.96 0.98 0.82 0.96 0.96 0.96
Table 1: Correlation of confusion matrix rows for the PGDL2 attack and the generated data.

To give a feel for the outputs from the deceptive diffusion model, Figure 5 (upper) shows 100 independent outputs corresponding to the label ‘9’. We note from Figure 3 that 0%percent00\%0 % of such outputs are classified as nines. Hence, we see that the model is capable of producing convincing adversarial images. For comparison, Figure 5 (lower) shows the results of PGDL2 on images from the ‘9’ category. Similar figures for the other labels are given in Appendix A.

Refer to caption
Refer to caption
Figure 5: Upper: example of 100 images arising when the deceptive diffusion model was given the label ‘9’. Lower: example of 100 images arising from successful PGDL2 attacks on images that had label ‘9’.

3.2.1 Partial Attacks

So far, we have looked at two options for the training data. Either all training data was attacked, or all training data was clean. Now we look at a third case: partially attacked training data. Again we choose the same MNIST images that were successfully attacked using PGDL2. Consider p{0,20,40,60,80,100}𝑝020406080100p\in\{0,20,40,60,80,100\}italic_p ∈ { 0 , 20 , 40 , 60 , 80 , 100 }. For each class, we replace p%percent𝑝p\%italic_p % of the clean images with their successfully attacked counterpart. Now using these six datasets, we train six models.

For each class, 100 images are generated using each of the trained models. In Figure 6 we show the resulting accuracy of the classifier on these generated images for the models trained on varying levels of poisoned data. We see that the classification accuracy degrades roughly in proportion with the amount of poisoned training data. This result is intuitively reasonable, under the assumption that all training images carry equal weight when the diffusion model is created. Confusion matrices for the partially trained models can be found in Appendix B.

Refer to caption
Figure 6: Classification accuracy (vertical axis) for output from a deceptive diffusion model where a percentage of the training data (horizontal axis) is replaced by its adversarially attacked counterpart. The slope representing linear proportionality is also shown.

3.2.2 Fréchet Inception Distance

A widely used measure for generated image quality is the Fréchet Inception Distance (FID)[12], where lower is better. It compares a generated dataset to a ground truth dataset. First, a classifier is used to extract features. Then the Fréchet distance between these feature sets is computed. Typically the Inception v3 classifier [28] without its last layer is used. To take into account that the generator is conditioned on the class, we use the Class-Aware Fréchet Distance (CAFD), which computes the FID for every class and takes the average [19].

Since our dataset is of low resolution, instead of Inception v3 we use the classifier that we trained earlier, with its last layer removed. This way the output is in 128superscript128\mathbb{R}^{128}blackboard_R start_POSTSUPERSCRIPT 128 end_POSTSUPERSCRIPT.

In Figure 7, the CAFD is shown for the diffusion models trained with partially poisoned data. These values are compared with the CAFD for the test set and the PGDL2 attacked training set. These are displayed at p=0𝑝0p=0italic_p = 0 and p=1𝑝1p=1italic_p = 1 respectively, because they represent samples from the ground truths for the clean and attacked case respectively. To avoid bias, these two sets are limited to contain the same number of samples as the generated sets, [6].

The results in Figure 7 show that the CAFD increases monotonically as the level of poisoning increases. This seems reasonable, because, as shown in Figure 6, higher levels of poisoning lead to higher levels of misclassification. The CAFD relies on the feature extraction of an MNIST classifier. Since the attacks target the classifier, it makes sense that the extracted features are different. The key observation here is that the fully adversarial model (p=1𝑝1p=1italic_p = 1) corresponds to a CAFD that is similar to that of the PGDL2 attacked data set, indicating that deception diffusion can mimic adversarially attacked data successfully according to this metric.

Refer to caption
Figure 7: Class-aware Fréchet Distance for a deceptive diffusion model where a percentage of the training data (horizontal axis) is replaced by its adversarially attacked counterpart. The ground truth dataset is MNIST. The straight line joins the CAFD for the test set at p=0𝑝0p=0italic_p = 0 and the PGDL2 attacked training set at p=1𝑝1p=1italic_p = 1. These two sets contain the same number of samples as the generated sets.

4 Conclusions

A traditional adversarial attack algorithm aims to perturb an existing image across a decision boundary. Instead, by training a generative diffusion model on adversarial data, we are able to create synthetic images that automatically lie on the wrong side of a decision boundary. This observation, which we believe to have been made for the first time in this work, reveals a new type of vulnerability for generative AI: if a diffusion model is inadvertently trained on fully or partially poisoned data then a tool may be produced that generates unlimited amounts of classifier-fooling examples.

In common with the AdvDiffuser algorithm in [5], when deliberately trained on adversarial data, a deceptive diffusion model has the potential to

  • create effective adversarial images at scale, independently of the amount of training and test data available,

  • create examples of misclassification that are difficult to obtain with a traditional adversarial attack; for example, in a healthcare setting when certain classes are underrepresented in the data [16].

This technique has applications for defence as well as attack, since it provides valuable new sources of data for adversarial training algorithms that aim to improve robustness.

There are many directions in which the deceptive diffusion idea could be pursued; notably, testing on other types of labeled image data, generating adversarial images that are successful across a range of independent classifiers, and finding computable signatures with which to identify this new type of threat.

Data Statement

Code for these experiments will be made available upon publication.

References

  • [1] Gradient-based learning applied to document recognition, Proceedings of the IEEE, 86 (1998), pp. 2278–2324.
  • [2] N. Akhtar and A. Mian, Threat of adversarial attacks on deep learning in computer vision: A survey, IEEE Access, 6 (2018), pp. 14410–14430.
  • [3] L. Beerens and D. J. Higham, Adversarial ink: componentwise backward error attacks on deep learning, IMA Journal of Applied Mathematics, 89 (2024), pp. 175–196.
  • [4] H. Cao, C. Tan, Z. Gao, Y. Xu, G. Chen, P.-A. Heng, and S. Z. Li, A survey on generative diffusion models, IEEE Transactions on Knowledge and Data Engineering, 36 (2024).
  • [5] X. Chen, X. Gao, J. Zhao, K. Ye, and C.-Z. Xu, AdvDiffuser: Natural adversarial example synthesis with diffusion models, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 4562–4572.
  • [6] M. J. Chong and D. Forsyth, Effectively unbiased fid and inception score and where to find them, in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 6070–6079.
  • [7] M. J. Colbrook, V. Antun, and A. C. Hansen, The difficulty of computing stable and accurate neural networks: On the barriers of deep learning and Smale’s 18th problem, Proceedings of the National Academy of Sciences, (2021).
  • [8] E. Comission, Proposal for a regulation of the european parliament and of the council laying down harmonised rules on artificial intelligence (artificial intelligence act) and amending certain union legislative acts, (2021).
  • [9] P. Dhariwal and A. Nichol, Diffusion models beat GANs on image synthesis, in Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, eds., vol. 34, Curran Associates, Inc., 2021, pp. 8780–8794.
  • [10] A. Fawzi, O. Fawzi, and P. Frossard, Analysis of classifiers’ robustness to adversarial perturbations, Machine Learning, 107 (2018), pp. 481–508.
  • [11] I. J. Goodfellow, J. Shlens, and C. Szegedy, Explaining and harnessing adversarial examples, in 3rd International Conference on Learning Representations, San Diego, CA, Y. Bengio and Y. LeCun, eds., 2015.
  • [12] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, GANs trained by a two time-scale update rule converge to a local Nash equilibrium, in Advances in Neural Information Processing Systems, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, eds., Long Beach, CA, USA, 2017, pp. 6626–6637.
  • [13] C. F. Higham, D. J. Higham, and P. Grindrod, Diffusion models for generative artificial intelligence: An introduction for applied mathematicians, arXiv:2312.14977, (2023).
  • [14] J. Ho, A. Jain, and P. Abbeel, Denoising diffusion probabilistic models, in Proceedings of the 34th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2020, Curran Associates Inc.
  • [15] H. Kim, Torchattacks: A PyTorch repository for adversarial attacks, arXiv preprint arXiv:2010.01950, (2020).
  • [16] I. Ktena, O. Wiles, I. Albuquerque, S.-A. Rebuffi, R. Tanno, A. G. Roy, S. Azizi, D. Belgrave, P. Kohli, T. Cemgil, A. Karthikesalingam, and S. Gowal, Generative models improve fairness of medical classifiers under distribution shifts, Nature Medicine, 30 (2024), pp. 1166–1173.
  • [17] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, Backpropagation applied to handwritten zip code recognition, Neural Computation, 1 (1989), pp. 541–551.
  • [18] Y. LeCun, C. Cortes, and C. J. C. Burges, The MNIST database of handwritten digits.
  • [19] S. Liu, Y. Wei, J. Lu, and J. Zhou, An improved evaluation framework for generative adversarial networks, arXiv preprint arXiv:1803.07474, (2018).
  • [20] C. Luo, Understanding diffusion models: A unified perspective, arXiv:2208.11970, (2022).
  • [21] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, Towards deep learning models resistant to adversarial attacks, in 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, OpenReview.net, 2018.
  • [22]  , Towards deep learning models resistant to adversarial attacks, in 6th International Conference on Learning Representations, Vancouver, BC, OpenReview.net, 2018.
  • [23] S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, Deepfool: A simple and accurate method to fool deep neural networks, in 2016 IEEE Conference on Computer Vision and Pattern Recognition, NV, USA, IEEE Computer Society, 2016, pp. 2574–2582.
  • [24] E. Parliament, Amendments adopted by the european parliament on 14 june 2023 on the proposal for a regulation of the european parliament and of the council on laying down harmonised rules on artificial intelligence (artificial intelligence act) and amending certain union legislative acts, (2023).
  • [25] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, High-resolution image synthesis with latent diffusion models, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10684–10695.
  • [26] O. Ronneberger, P. Fischer, and T. Brox, U-Net: Convolutional networks for biomedical image segmentation, in Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi, eds., Cham, 2015, Springer International Publishing, pp. 234–241.
  • [27] A. Shafahi, W. Huang, C. Studer, S. Feizi, and T. Goldstein, Are adversarial examples inevitable?, International Conference on Learning Representations, New Orleans, USA, (2019).
  • [28] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, Rethinking the inception architecture for computer vision, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826.
  • [29] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, Intriguing properties of neural networks, arXiv preprint arXiv:1312.6199, (2013).
  • [30] I. Y. Tyukin, D. J. Higham, A. Bastounis, E. Woldegeorgis, and A. N. Gorban, The feasibility and inevitability of stealth attacks, IMA Journal of Applied Mathematics, (2023), p. hxad027.
  • [31] I. Y. Tyukin, D. J. Higham, and A. N. Gorban, On adversarial examples and stealth attacks in artificial intelligence systems, in 2020 International Joint Conference on Neural Networks (IJCNN), IEEE, 2020, pp. 1–6.

Appendix A Further Output Examples

In Figures 8 to 16 we give analogues of Figure 5 for the categories ‘0’, ‘1’,‘2’,…,‘8’.

Refer to caption
Refer to caption
Figure 8: Upper: example of 100 images arising when the deceptive diffusion model was given the label ‘0’. Lower: example of 100 images arising from successful PGDL2 attacks on images that had label ‘0’.
Refer to caption
Refer to caption
Figure 9: Upper: example of 100 images arising when the deceptive diffusion model was given the label ‘1’. Lower: example of 100 images arising from successful PGDL2 attacks on images that had label ‘1’.
Refer to caption
Refer to caption
Figure 10: Upper: example of 100 images arising when the deceptive diffusion model was given the label ‘2’. Lower: example of 100 images arising from successful PGDL2 attacks on images that had label ‘2’.
Refer to caption
Refer to caption
Figure 11: Upper: example of 100 images arising when the deceptive diffusion model was given the label ‘3’. Lower: example of 100 images arising from successful PGDL2 attacks on images that had label ‘3’.
Refer to caption
Refer to caption
Figure 12: Upper: example of 100 images arising when the deceptive diffusion model was given the label ‘4’. Lower: example of 100 images arising from successful PGDL2 attacks on images that had label ‘4’.
Refer to caption
Refer to caption
Figure 13: Upper: example of 100 images arising when the deceptive diffusion model was given the label ‘5’. Lower: example of 100 images arising from successful PGDL2 attacks on images that had label ‘5’.
Refer to caption
Refer to caption
Figure 14: Upper: example of 100 images arising when the deceptive diffusion model was given the label ‘6’. Lower: example of 100 images arising from successful PGDL2 attacks on images that had label ‘6’.
Refer to caption
Refer to caption
Figure 15: Upper: example of 100 images arising when the deceptive diffusion model was given the label ‘7’. Lower: example of 100 images arising from successful PGDL2 attacks on images that had label ‘7’.
Refer to caption
Refer to caption
Figure 16: Upper: example of 100 images arising when the deceptive diffusion model was given the label ‘8’. Lower: example of 100 images arising from successful PGDL2 attacks on images that had label ‘8’.

Appendix B Confusion matrices for partial attacks

In Figures 17 to 20 we give the confusion matrices for the deceptive diffusion model trained with p𝑝pitalic_p% attacked data where p{20,40,60,80}𝑝20406080p\in\{20,40,60,80\}italic_p ∈ { 20 , 40 , 60 , 80 }.

Refer to caption
Figure 17: Confusion matrix for the deceptive diffusion model trained with 20% attacked data. For a given label (row) we show the frequency with which the classifier assigned each label (column) to the output. Entries on the diagonal therefore correspond to unsuccessful attempts to create an adversarial image. Overall misclassification rate is 16.8%.
Refer to caption
Figure 18: Confusion matrix for the deceptive diffusion model trained with 40% attacked data. For a given label (row) we show the frequency with which the classifier assigned each label (column) to the output. Entries on the diagonal therefore correspond to unsuccessful attempts to create an adversarial image. Overall misclassification rate is 35.4%.
Refer to caption
Figure 19: Confusion matrix for the deceptive diffusion model trained with 60% attacked data. For a given label (row) we show the frequency with which the classifier assigned each label (column) to the output. Entries on the diagonal therefore correspond to unsuccessful attempts to create an adversarial image. Overall misclassification rate is 53.8%.
Refer to caption
Figure 20: Confusion matrix for the deceptive diffusion model trained with 80% attacked data. For a given label (row) we show the frequency with which the classifier assigned each label (column) to the output. Entries on the diagonal therefore correspond to unsuccessful attempts to create an adversarial image. Overall misclassification rate is 72.8%.