DeepiSign-G: Generic Watermark to Stamp Hidden DNN Parameters for Self-contained Tracking

Alsharif Abuadbba, Nicholas Rhodes, Kristen Moore, Bushra Sabir, Shuo Wang, Yansong Gao Alsharif Abuadbba, Nicholas Rhodes, Kristen Moore, Bushra Sabir, Shuo Wang and Yansong Gao are with Data61, CSIRO. e-mail: {sharif.abuadbba, nicholas.rhodes, kristen.moore, bushra.sabir, yansong.gao, shuo.wang}@data61.csiro.au
Abstract

The use of deep learning solutions in critical domains - such as autonomous vehicles, facial recognition, and sentiment analysis - is approached with warranted caution due to the potentially severe consequences of errors. Research has demonstrated that these models are vulnerable to adversarial attacks, including data poisoning and neural trojaning. These types of attacks enable adversaries to covertly manipulate model behavior, thereby compromising their reliability and safety in high-stakes scenarios. A recent trend in defence strategies is to employ watermarking to ensure the ownership of deployed models, but they have two limitations: i) they do not detect every modification of the model, and ii) they have exclusively focused on attacks on CNNs performing tasks in the image domain, and neglect other critical neural architectures such as RNNs.

Addressing these gaps, we introduce DeepiSign-G, a novel and versatile watermarking approach designed to comprehensively verify leading DNN architectures, including CNNs and RNNs. DeepiSign-G enhances model security by randomly embedding an invisible watermark within the Walsh-Hadamard transform coefficients of the model’s parameters. This watermark is ingeniously integrated to be highly sensitive and inherently fragile, ensuring that any modification to the model’s parameters is promptly and reliably detected. Distinct from conventional hashing techniques, DeepiSign-G permits the incorporation of substantial metadata directly within the model, facilitating detailed, self-contained tracking and verification capabilities.

We demonstrate DeepiSign-G’s broad applicability across various deep neural network architectures, including CNN models (VGG [1], ResNets [2], DenseNet [3]) and RNNs (Text sentiment classifier [4]). We experiment with 4 popular datasets, including VGG Face [5], CIFAR10 [6], GTSRB Traffic Sign [7], and Large Movie Review [8]. We also evaluate DeepiSign-G under 5 potential attacks. Our comprehensive evaluation confirms that DeepiSign-G effectively detects these attacks without compromising the performance of CNN and RNN models, underscoring its efficacy as a robust security measure for deep learning applications. We find that the detection of any integrity breach is near perfect, while only hiding a bit in 1%similar-toabsentpercent1\sim 1\%∼ 1 % of the Walsh-Hadamard coefficients.

Index Terms:
DNN, Watermark, Integrity, Authenticity

I Introduction

Deep neural networks (DNNs) have demonstrated significant success in various fields, such as healthcare, autonomous transportation, and facial recognition. However, their integration into high-stakes, real-world applications is often met with skepticism due to concerns over their trustworthiness and lack of transparency. The “black box” nature of DNNs complicates efforts to build trust and verify their integrity, especially in scenarios where models, trained by trusted entities, are deployed on a large scale [9, 10]. This apprehension is amplified by studies revealing the susceptibility of DNNs to malicious attacks [11, 12, 13, 14], underscoring the critical need for robust verification mechanisms to ensure their security and reliability in sensitive applications.

Attacks. Deployed models face significant threats from poisoning attacks, which aim to undermine model integrity or disrupt their availability. Demonstrations by Gu et al. [11] of a traffic sign classification model being compromised through a simple visual trigger injected into the model with minimal effort illustrate the practical feasibility of such attacks. Liu et al.’s research  [12] further reveals that trojaning attacks can be carried out efficiently using minimal resources by exploiting the existing structure of the model.

Data poisoning, another attack strategy [14], involves retraining models with falsely labeled data, leading to targeted misclassification. These findings highlight the alarming possibility that even complex and expensive-to-train models can be quickly and economically compromised by attackers with slight tuning [15, 16, 17, 18].

Research Problem. The growing dependence on DNNs by vendors, who invest heavily in computational resources and high-quality data to train models for deployment in products like autonomous vehicles, raises critical concerns about model vulnerability. Additionally, recent concerns about user privacy on big tech servers have driven a push towards deploying more DNNs on edge or on-premise to meet various regulatory requirements like DGPR and EU AI act 2024111https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai  [19]. To ensure the integrity and authenticity of these models, a secure and systematic method for tracking associated metadata, including training datasets, parameters, and authorized modifications is needed. A desirable solution would embed this information directly within the model, eliminating the need for external management and enhancing vendor accountability in high-stakes applications, especially in cases of erroneous model decisions. Therefore, this paper focuses on addressing the following Research Question (RQ):

How can we devise a method to securely embed and verify essential metadata within DNNs to ensure their integrity, authenticity, and functionality?

Existing Landscape. A straightforward solution to this problem involves using cryptographic techniques, such as digital signatures and authentication codes, to protect the integrity and authenticity of CNN models. However, distributing and securely managing these signatures poses a challenge. If a signature is lost or tampered with, it becomes difficult to determine if the model has been compromised. To address this, protecting the signature itself may be necessary, which could require establishing further infrastructure, such as certificate authorities. Furthermore, each new DNN model requires a unique signature, necessitating the secure storage of multiple signatures along with all metadata, which can become burdensome in environments lacking robust security measures.

Therefore, current defense mechanisms have mainly focused on detecting poisoned training samples [20] and trigger inputs [21], and retraining models to remove their backdoors [11]. However, those methods can not detect modifications to the DNN model itself. Another line of defense strategies employs watermarking to establish ownership, as seen in [22, 23, 24, 25, 26, 27]. While these approaches have proven to be successful in asserting intellectual property (IP) rights, they do not safeguard the integrity of the system against the poisoning attacks outlined. The existing models implement persistent watermarking which is designed to resist the changes by an adversary who wants to steal the DNN and falsely claim ownership. As detailed in [27], this resilience ensures that, even when the DNN is subjected to attacks that modify the weights of the hidden layers, the watermark remains robust and unaltered. Consequently, this durability enables reliable preservation of the DNN’s ownership, despite any such adversarial modifications. However, these watermarking solutions have notable limitations: (1) they fail to detect every modification to the model, and (2) they are primarily designed on convolutional neural networks (CNNs) in the image domain, overlooking other crucial architectures like Recurrent Neural Networks (RNNs) and media types like text.

In our previous work, we proposed DeepiSign [28] as the first fragile (by design) watermark to protect the integrity and authenticity of models in computer vision tasks. He et al. [29] also introduced the potential of using generated sensitive samples to check the computer vision model integrity. However, in both works, there are still research questions that have yet to be answered: i) Can these methods be applied to other domains such as text (i.e., RNN)? ii) If so, how efficient are they? Our initial investigation indicates that these are specifically designed for computer vision tasks and architectures, restricting their broader utility. Therefore, this work aims to address the main research question while taking into account these considerations. To this end, we develop a generic fragile watermark by design to detect any modifications and evaluate it against various perturbation techniques beyond the vision domain. We comprehensively evaluate the efficacy of our method, DeepiSign-G, with many model architectures, applications and datasets across text and vision domains. In comparison to [28], we have made the following contributions:

  • We propose a model integrity and authenticity protection method, DeepiSign-G222https://github.com/SharifAbuadbba/DeepiSign-G, as a novel generic watermark technique that protects a variety of DNN architectures (CNN and RNN) and is applicable across multiple domains including vision and text.

  • We devise an invisible fragile watermarking method that embeds the metadata (bit-by-bit) into the frequency domain coefficients of model parameters using the Walsh-Hadamard transform. This transform is chosen for its efficiency and ability to reconstruct model parameters with little distortion or impact on model performance. Using this approach, changes to any particular parameter are distributed across the Walsh-Hadamard coefficients, such that even highly targeted modifications to model parameters will cause corruption of the embedded metadata.

  • We formulate a generic strong security protocol for the watermark using a key-based algorithm that: 1) Divides the DNN’s millions of parameters into random blocks, 2) Randomises the distribution of the parameters in these blocks, and 3) Randomises the associated metadata (at the bit level) to unique bits in the frequency domain coefficients of DNN parameters.

  • We demonstrate the model independence of our DeepiSign-G through experimental validation across popular model architectures such as CNN (VGG [1], ResNets [2], DenseNet [3]) and RNNs (Text sentiment classifier LSTM [4]). We use 4 datasets including VGG Face [5], CIFAR10 [6], Traffic Sign (GTSRB) [7], Large Movie Review [8].

  • We evaluate DeepiSign-G under 5 potential attacks across the 3 domains: Face recognition trojaning attack [12], Text sentiment trojaning attack [13], Output poisoning [30], Direct targeted modification, and Arbitrary modification attack [31]. We find that DeepiSign-G does not impair the performance of either CNN or RNN models while being able to detect these attacks successfully.

Roadmap. Section II provides the background and outlines the threat model. Section III discusses key insights and challenges. Section IV details the design of the DeepiSign-G system. Section VI describes the evaluation process. Finally, Section VIII summarizes the paper’s conclusions.

II Background and Threat Model

II-A Deep Neural Networks

Neural networks are parametrised functions fθ:nm:subscript𝑓𝜃maps-tosuperscript𝑛superscript𝑚f_{\theta}:\mathbb{R}^{n}\mapsto\mathbb{R}^{m}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ↦ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT map** a set of inputs 𝒳n𝒳superscript𝑛\mathcal{X}\in\mathbb{R}^{n}caligraphic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT to a set of outputs 𝒴m𝒴superscript𝑚\mathcal{Y}\in\mathbb{R}^{m}caligraphic_Y ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT. The input and output dimensions depend on the model’s specified task. For example, neural networks have been widely applied for image classification problems. To classify images with 1024 features into 10 classes, the neural network would be a function map** 102410maps-tosuperscript1024superscript10\mathbb{R}^{1024}\mapsto\mathbb{R}^{10}blackboard_R start_POSTSUPERSCRIPT 1024 end_POSTSUPERSCRIPT ↦ blackboard_R start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT. Typically, the parameters θ𝜃\thetaitalic_θ are learned through an iterative optimisation process such that the actual outputs of the network 𝒴𝒴\mathcal{Y}caligraphic_Y minimise an objective function \mathcal{L}caligraphic_L which compares 𝒴𝒴\mathcal{Y}caligraphic_Y to some desired output distribution 𝒴^^𝒴\hat{\mathcal{Y}}over^ start_ARG caligraphic_Y end_ARG. A large body of research exists surrounding this problem [32, 33].

Deep neural networks are functions built up of many layers, modelled loosely on the communication between neurons. Each layer lisubscript𝑙𝑖l_{i}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT consists of nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT neurons, which receive input from neurons in the previous layer and produce an output. Broadly speaking, neurons from the previous layer are related to the neurons in the current layer by a set of weights wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and the outputs (also called activations) aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of the neurons at layer i𝑖iitalic_i are calculated as ai=ϕ(wiai1+bi)subscript𝑎𝑖italic-ϕsubscript𝑤𝑖subscript𝑎𝑖1subscript𝑏𝑖a_{i}=\phi(w_{i}a_{i-1}+b_{i})italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_ϕ ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is called the bias term and ϕitalic-ϕ\phiitalic_ϕ is a non-linear function such as the sigmoid function.

The exact way the neurons of different layers may be more complicated than the simple feedforward network we have just outlined. Convolutional neural networks (CNNs) [34, 35, 32, 36, 37] use the convolution operation in place of regular matrix multiplication. They share the weights that are applied to different parts of the output of the previous layer, which has proven to be powerful in capturing image features particularly. Recurrent neural networks (RNNs) model temporal relationships between objects in sequence inputs by maintaining a state vector which captures the history of all past elements in the sequence [32]. They perform particularly well for text domain tasks.

However, central to our approach in this paper is that DNNs can always be represented as operations between matrices of parameters. That is, a DNN is fully defined by its parameters and the structure between them.

II-B Walsh-Hadamard Transform

The Walsh-Hadamard transform, which is widely studied in signal processing [38, 39] and data compression, decomposes a signal of 2msuperscript2𝑚2^{m}2 start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT numbers into a new domain, in a similar manner to the discrete Fourier transform. The resultant transform coefficients allow the original signal to be written as a superposition of Walsh functions, which are rectangular waves with values +11+1+ 1 or 11-1- 1, each having unique sequency values, where sequency is half the average number of zero crossing per unit time  [40]. Thus, the Walsh-Hadamard transform breaks down the input signal into its constituent sequencies/frequencies in a similar manner to the Fourier transform breaking down signals into constituent frequencies. In particular, this transform is linear and symmetric, and maps 2msuperscript2𝑚2^{m}2 start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT real numbers x𝑥xitalic_x into 2msuperscript2𝑚2^{m}2 start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT real numbers y𝑦yitalic_y for some m𝑚m\in\mathbb{N}italic_m ∈ blackboard_N. Thus, the transform can be represented as a 2m×2msuperscript2𝑚superscript2𝑚2^{m}\times 2^{m}2 start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT × 2 start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT matrix, which is called the Hadamard matrix Hmsubscript𝐻𝑚H_{m}italic_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. Hmsubscript𝐻𝑚H_{m}italic_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT can be defined recursively.

H0=1;Hm=12(Hm1Hm1Hm1Hm1)formulae-sequencesubscript𝐻01subscript𝐻𝑚12matrixsubscript𝐻𝑚1subscript𝐻𝑚1subscript𝐻𝑚1subscript𝐻𝑚1H_{0}=1;\qquad H_{m}=\frac{1}{\sqrt{2}}\begin{pmatrix}H_{m-1}&H_{m-1}\\ H_{m-1}&H_{m-1}\end{pmatrix}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1 ; italic_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG ( start_ARG start_ROW start_CELL italic_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_H start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) (1)

Since the transform is symmetric, the inverse Walsh-Hadamard transform is the same transform with rescaling.

Similar to the fast Fourier transform (FFT), there exists a Fast Walsh-Hadamard transform (FWHT) which operates in O(nlogn)𝑂𝑛𝑛O(n\log{n})italic_O ( italic_n roman_log italic_n ) time by breaking down the transform into two smaller transforms using the recursive definition above. Fast in-place implementations of the FWHT are also possible [41] [42]. The FWHT is desirable over the FFT for some applications as it requires less storage space and is faster to calculate because it only requires real additions and subtractions  [40].

II-C Threat Model

In DeepiSign-G’s application framework, our primary scenario envisions a dynamic between a trusted DNN model vendor and a consumer. Here, the premise is that the vendor, utilizing their computational resources, trains the model, which upon deployment to the consumer, should remain unaltered by third parties. Essentially, only the vendor should have the right to make changes to the model. Within this context, it is imperative for the consumer to ascertain the model’s integrity and authenticity both before and during its operational lifecycle. Given the potential scenario where the vendor may lack direct access to the model post-deployment, verification processes need to be localized and automated on the consumer’s end.

This study investigates potential adversarial threats targeting the integrity of DNNs, which could manifest if an adversary gains access to the model. Such access could occur through insider manipulation of training data or direct modification of model parameters. Additionally, threats could arise from compromises on the consumer side. Such attacks could potentially alter the model’s intended behavior or diminish its performance which have been widely demonstrated in the literature [12, 13, 30, 31]. Our experiments, detailed in section VI, explore the spectrum of attacks that vary in complexity and control level over the model. This includes manipulation of training data, retraining capabilities, and direct modifications to model parameters, mirroring the assumption of comprehensive model access as suggested in prior research by Liu et al. [12].

Refer to caption
Figure 1: (Top) Plot of first 10K of DNN hidden layers weights. (Middle) Plot of similar 10K weights after flip** number of bits which reflected clearly as distortion. (Bottom) Plot of similar 10K weights after converting them into Walsh-Hadamard frequency space and flip** number of bits which demonstrates little effect.

III Key Insights and Challenges

To achieve an imperceptible watermark impact, embedding secret bits within model parameters must avoid distorting them, as they significantly affect the model’s accuracy. Direct manipulation of model parameters in their original time domain results in noticeable distortion. As shown in Fig.1 (top, middle), flip** 10 bits of the first 10,000 parameters causes negative weight distortion. This motivates the exploration of frequency domain transformation techniques, which offer a lower distortion impact on reconstruction. In our prior work [28], we employed wavelet transform but identified two limitations: 1) 50% of the transformed coefficients cannot be modified due to the underlining wavelet tree constraints, limiting hiding capacity and security, and 2) the resulting wavelet tree is computationally complex and requires significant storage.

To address these limitations, we propose using the Fast Walsh-Hadamard Transform (FWHT), a light-weight transformation technique that 1) allows modification of all coefficients, increasing hiding capacity, and 2) is faster to compute using simple operations (+,-). Fig. 1 (bottom) demonstrates that flip** 10 bits in the FWHT space has minimal impact on reconstruction compared to the top.

Challenges. While FWHT seems a reasonable candidate for our generic watermark technique, nevertheless, we identify two challenges that need to be solved.

  1. 1.

    Challenge #1: Diverse DNN Layer Structures. DNN models, such as CNNs and RNNs, have diverse hidden layer structures with varying dimensions. Therefore, another challenge to address is designing a generic preprocessing framework to apply FWHT, which requires sequential blocks.

  2. 2.

    Challenge #2: Overflow. The resultant FWHT coefficients are floating-point numbers and must be converted to integers before hiding the bits. This conversion involves shifting the numbers by multiplying them by a constant. For example, 0.1234×10000=12340.12341000012340.1234\times 10000=12340.1234 × 10000 = 1234. However, after exploring several DNN models in various domains such as vision, text, and audio, we found that their weights have a wide range of precision, meaning the significant non-zero decimal numbers (represented by ’X’ in this scenario) can vary from 0X0𝑋0X0 italic_X to 00X00𝑋00X00 italic_X or even 0000000X0000000𝑋0000000X0000000 italic_X. In other words, applying fixed constant will not ensure that the significant decimal number is not lost during the conversion to and from an integer.

To tackle the challenge of diverse DNN layer structures (i.e., challenge #1), we devised a mechanism to convert 2-dimensional or even 3-dimensional model parameters/layers into 1-dimensional form and allocate these parameters randomly into blocks suitable for FWHT. This mechanism is entirely reversible, effectively eliminating the need to handle different model architectures such as CNNs and RNNs (refer to Section IV-C1 for detailed information).

To address the overflow problem (i.e., challenge #2) and avoid the loss of parameter values and the resulting impact on accuracy caused by multiplying by a constant, we designed an algorithm to ensure flexibility in the multiplier value. This is achieved by identifying the maximum order of magnitude among all parameters and using it as the target significance decimal number for conversion into an integer (refer to Section IV-C3 for detailed information).

IV DeepiSign-G System Design

In this section, we design and implement DeepiSign-G to answer the RQ: How can we devise a method to securely embed and verify essential metadata within DNNs, ensuring their integrity, authenticity, and functionality?

Refer to caption
Figure 2: A high-level overview of the DeepiSign-VT embedding, retrieval and verification processes.
Refer to caption
Figure 3: DeepiSign-VT embedding embedding steps that are explained in Embedding Algorithm C in details.

IV-A Overview

We begin by outlining our design requirements, followed by a comprehensive overview of DeepiSign-G’s two main stages: embedding and retrieval/verification, as shown in Figure 2. The embedding stage consists of the following five key steps as depicted in Figure 3. (1) Preprocessing and randomly assigning parameters to transform blocks, (2) Walsh-Hadamard transform, (3) Integer representation of the Walsh-Hadamard, (4) Randomly choosing hiding locations, (5) Hiding bits and reversing the transformations. In the retrieval and verification stage, we elucidate the process of reversing the embedding stage. Additionally, we propose two innovative applications for DeepiSign-G in verification scenarios. (a) Its use as an integrity verification mechanism, (b) Its use as a self-contained metadata tracking mechanism. The following subsections provide a detailed explanation of each stage and its components.

IV-B Design Requirements

The crucial requirement in designing a fragile watermarking technique is that the DNN performance is not impaired while being able to effectively detect modification to any parameters. Furthermore, given that we also propose DeepiSign-G as a self-contained mechanism to track model metadata by embedding it within the model itself, we aim to have sufficient capacity to hide a meaningful amount of information within the model. Here, we clearly define the desired qualities of our verification mechanism.

  • Integrity: Highly sensitive to any model modifications.

  • Tracking: The origin of the model and its training process can be verified from the watermark.

  • Capacity: A large amount of information can be embedded and distributed inside the model, allowing useful information to be stored.

  • Accuracy: There is no measurable depreciation in the model’s performance due to the embedded watermark.

  • Confidentiality: Only authorised parties should be able to retrieve the embedded watermark from the DNN using a key.

  • Invisibility: The watermark should be undetectable within the model parameters, ensuring the watermarked model utility while maintaining the same quality as the original.

  • Generalisability The mechanism should be independent of the neural network or task and should be broadly applicable.

IV-C Embedding Algorithm

With these design requirements in mind, we propose the following invisible watermarking mechanism that is fragile by design to be able to detect any modification. Broadly, the algorithm involves: (1) randomly assigning the model parameters to transform blocks; (2) randomly choosing which coefficients and which bits within coefficients to hide bits of the data to be embedded; (3) taking the Walsh-Hadamard transform and converting the Walsh-Hadamard coefficients to integers, (4) hiding the bits in the chosen places; (5) reversing the conversion by converting the coefficients back to floats before taking the inverse Walsh-Hadamard transform to obtain the new model parameters with the hidden data embedded. Each stage of the process is elaborated upon below.

IV-C1 Preprocessing and assigning parameters to transform blocks

Firstly, model parameters from all DNN layers are reshaped from their existing structure into a 1D format to be broken into blocks to be passed to the Walsh-Hadamard transform. In our implementation, we did not discriminate between the parameters of different layers since we used the entire model as a hiding space. However, localising the process and treating each layer separately is highly feasible as depicted in Equation 2. It is crucial that the original structure of the model is retained, and this transformation can be reversed after embedding the hidden data in the model parameters to restore the original structure of the model with parameters in their original location.

[w11w12w1nw21w2,2w2nwm1wm2wmn][wm2w21wm1][w12w22w11][w1nw2nwmn]matrixsubscript𝑤11subscript𝑤12subscript𝑤1𝑛subscript𝑤21subscript𝑤22subscript𝑤2𝑛subscript𝑤𝑚1subscript𝑤𝑚2subscript𝑤𝑚𝑛matrixsubscript𝑤𝑚2subscript𝑤21subscript𝑤𝑚1matrixsubscript𝑤12subscript𝑤22subscript𝑤11matrixsubscript𝑤1𝑛subscript𝑤2𝑛subscript𝑤𝑚𝑛\begin{bmatrix}w_{11}&w_{12}&w_{1n}\\ w_{21}&w_{2,2}&w_{2n}\\ w_{m1}&w_{m2}&w_{mn}\end{bmatrix}\Rightarrow\begin{bmatrix}w_{m2}\\ w_{21}\\ w_{m1}\end{bmatrix}\begin{bmatrix}w_{12}\\ w_{22}\\ w_{11}\end{bmatrix}...\begin{bmatrix}w_{1n}\\ w_{2n}\\ w_{mn}\end{bmatrix}[ start_ARG start_ROW start_CELL italic_w start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT end_CELL start_CELL italic_w start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT end_CELL start_CELL italic_w start_POSTSUBSCRIPT 1 italic_n end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_w start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT end_CELL start_CELL italic_w start_POSTSUBSCRIPT 2 , 2 end_POSTSUBSCRIPT end_CELL start_CELL italic_w start_POSTSUBSCRIPT 2 italic_n end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_w start_POSTSUBSCRIPT italic_m 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_w start_POSTSUBSCRIPT italic_m 2 end_POSTSUBSCRIPT end_CELL start_CELL italic_w start_POSTSUBSCRIPT italic_m italic_n end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] ⇒ [ start_ARG start_ROW start_CELL italic_w start_POSTSUBSCRIPT italic_m 2 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_w start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_w start_POSTSUBSCRIPT italic_m 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL italic_w start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_w start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_w start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] … [ start_ARG start_ROW start_CELL italic_w start_POSTSUBSCRIPT 1 italic_n end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_w start_POSTSUBSCRIPT 2 italic_n end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_w start_POSTSUBSCRIPT italic_m italic_n end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] (2)

Since the Walsh-Hadamard transformation operates on inputs of size 2msuperscript2𝑚2^{m}2 start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT where m𝑚mitalic_m and n𝑛nitalic_n are the two-dimensional spaces of the model parameters, it is necessary to process the parameters in blocks that fit these requirements. There are some considerations to be made here surrounding the optimal block size to use when processing the transforms. Having the block size as large as possible is unwise, as minute changes in a single parameter are less likely to significantly affect the input’s frequency properties and, hence, the transform coefficients. Furthermore, the runtime of performing the transforms is improved, at least under asymptotic analysis, by performing more Walsh-Hadamard transforms on smaller inputs. In our implementation, we typically used a maximum block size of 2048.

Since the number of parameters in the model is unlikely to be a multiple of the maximum block size, we may need to create some smaller blocks to cover the remaining (numparams%maxblocksize𝑛𝑢𝑚𝑝𝑎𝑟𝑎𝑚percent𝑠𝑚𝑎𝑥𝑏𝑙𝑜𝑐𝑘𝑠𝑖𝑧𝑒num\ params\ \%\ max\ block\ sizeitalic_n italic_u italic_m italic_p italic_a italic_r italic_a italic_m italic_s % italic_m italic_a italic_x italic_b italic_l italic_o italic_c italic_k italic_s italic_i italic_z italic_e) parameters. We ensure these blocks are as large as possible while remaining powers of 2, and we randomly distribute the smaller blocks throughout the larger ones using a seed derived from the key k𝑘kitalic_k, referred to as Seed #1. The purpose of Seed #1 is to randomly shuffle the smaller blocks within the larger ones, preventing the smaller blocks from consistently appearing at the end.

Finally, the parameters are randomly distributed across the different blocks according to a seed derived from the watermark key k𝑘kitalic_k, referred to as Seed #2. This ensures that highly localised adjustments to model parameters are well distributed across the blocks, increasing confidence that the change will be detected in the hidden bits in the coefficients. Furthermore, it significantly increases the secrecy of the mechanism by relying mainly on a unique key and not the hiding algorithm. Even if a curious party obtains full access to the model and investigates taking the transform of the parameters with the correct block size, they will not be able to obtain the correct set of coefficients without the watermark key k𝑘kitalic_k.

The map** of parameters to blocks can be modelled as a function f(x,k)Bmaps-to𝑓𝑥𝑘𝐵f(x,k)\mapsto Bitalic_f ( italic_x , italic_k ) ↦ italic_B, where x𝑥xitalic_x denotes the parameters, k𝑘kitalic_k is the watermark key that determines the randomisation seeds, and B={b0,b1,,bn}𝐵subscript𝑏0subscript𝑏1subscript𝑏𝑛B=\{b_{0},b_{1},...,b_{n}\}italic_B = { italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } is the set of blocks described above. Each bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a set of parameters such that |bi|=2msubscript𝑏𝑖superscript2𝑚|b_{i}|=2^{m}| italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | = 2 start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT for some m𝑚m\in\mathbb{N}italic_m ∈ blackboard_N.

IV-C2 Walsh-Hadamard Transform

At this stage, the blocks of parameters are each passed to the Fast Walsh-Hadamard Transform to produce the Walsh-Hadamard coefficients. That is, FWHT(bi)=yiilen(B)formulae-sequence𝐹𝑊𝐻𝑇subscript𝑏𝑖subscript𝑦𝑖for-all𝑖𝑙𝑒𝑛𝐵FWHT(b_{i})=y_{i}\ \ \forall i\in len(B)italic_F italic_W italic_H italic_T ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∀ italic_i ∈ italic_l italic_e italic_n ( italic_B ).

Decomposing the parameters using the Walsh-Hadamard transform before hiding ensures that any distortion resulting from hiding is spread across many parameters rather than concentrated in a small number of parameters. In addition, this property gives the potential to hide data only in some of the coefficients while being sensitive to changes in all the transformed model parameters, reducing the impact on the original model. For a given block of transformed parameters, adjusting any parameter will adjust the decomposition of the input into constituent frequencies/sequencies and hence each of the Walsh-Hadamard coefficients will be slightly modified. This distortion will be reflected in the bits of the hidden data that are retrieved from the coefficients when verifying the model integrity. Without this distribution, we would need to hide bits directly in every parameter to have sensitivity to changes in every parameter, and upon retrieval, there is a 50%similar-toabsentpercent50\sim 50\%∼ 50 % chance that the corrupted value of the bit is the same as the expected value hidden at that location, i.e., no corruption would be detected. When the distortion from adjusting one model parameter after the DeepiSign-G embedding algorithm is distributed across many different Walsh-Hadamard coefficients, the chance of its detection is greatly improved and depends on how many bits are hidden in the coefficients of each block. In Section VI, we find that the detection of any integrity breach is near perfect, while only hiding a bit in 1%similar-toabsentpercent1\sim 1\%∼ 1 % of the Walsh-Hadamard coefficients.

IV-C3 Integer representation of the Walsh-Hadamard coefficients

The Walsh-Hadamard coefficients, crucial in many signal processing applications, are real numbers represented with finite precision. Hiding and retrieving bits from the least significant bits of floating-point numbers poses significant challenges (i.e., challenge #1). To avoid these issues, our approach involves hiding and retrieving data in an analogous and reproducible integer associated with each coefficient. Our method consists of multiplying the coefficient values for each block by 10dsuperscript10𝑑10^{d}10 start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and rounding them to the nearest integer. The parameter d𝑑ditalic_d plays a pivotal role in determining the number of decimal places to retain in the coefficient floats after reversing the integerization and dividing by 10dsuperscript10𝑑10^{d}10 start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Insufficient precision in d𝑑ditalic_d can lead to an inaccurate reconstruction of the original weights, highlighting the importance of choosing an appropriate value.

The precision that can be accurately retained is inherently tied to the size of the floating-point coefficient representation, particularly the mantissa’s size, which determines the maximum number of significant figures that can be preserved. While d𝑑ditalic_d can be determined empirically, this process must be approached with caution to prevent unintended corruption of the model when its integrity has not been compromised. Alternatively, one can measure the magnitude of the largest coefficient in the block and adjust d𝑑ditalic_d to retain a desired maximum number of significant figures after reversing the integerisation. This approach requires careful consideration, ensuring that the selected maximum is compatible with the precision of the float representation.

For instance, if the maximum order of magnitude of coefficients in a block is 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT, setting d=7𝑑7d=7italic_d = 7 allows us to retain 5 significant figures after multiplying and dividing by 10dsuperscript10𝑑10^{d}10 start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. This careful selection process ensures both the accuracy of the hidden data retrieval and the integrity of the original model.

IV-C4 Randomly choosing hiding locations

Inspired by existing frequency space steganography techniques used in media formats like images, we hide data by embedding it in the least significant portion of the Walsh-Hadamard coefficients. This approach minimizes distortion in the recreated signal. We select the least significant l𝑙litalic_l bits of each coefficient for integerisation, where l𝑙litalic_l is chosen to balance distortion minimization, sensitivity to integrity breaches, and capacity maximization.

To securely embed the secret data in the Walsh-Hadamard coefficients, we randomly assign each bit of the message to hide to a specific block, the coefficient within that block, and one of the least significant l𝑙litalic_l bits within the chosen coefficient. The constraint is that no two bits from the message can be hidden in the same position. These random assignments must be reproducible to retrieve the hidden data using the same watermark key k𝑘kitalic_k. The specific scrambling algorithm used for these assignments is an implementation decision, and different algorithms can be chosen based on their unique security and efficiency properties. In our implementation, we used a seed derived from the key k𝑘kitalic_k, referred to as Seed #3.

We can represent this assignment as an injective function g(mi,k)0,,nplmaps-to𝑔subscript𝑚𝑖𝑘0subscript𝑛𝑝𝑙g(m_{i},k)\mapsto{0,\dots,n_{p}\cdot l}italic_g ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_k ) ↦ 0 , … , italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⋅ italic_l, where npsubscript𝑛𝑝n_{p}italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is the total number of parameters in the model and misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a bit of the message to hide. Each bit of the message is mapped to one bit in the potential hiding space using Seed #3, allowing for a maximum capacity of nplsubscript𝑛𝑝𝑙n_{p}\cdot litalic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⋅ italic_l.

IV-C5 Hiding bits and reversing the transformations

After assigning message bits to hiding spots, we follow these steps to reverse the transformations described above: (1) Set the corresponding bits of the integer representations of the Walsh-Hadamard coefficients. (2) Reverse the division by 10dsuperscript10𝑑10^{d}10 start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT (where d𝑑ditalic_d may differ per transform block). (3) Perform the inverse Fast Walsh-Hadamard transform of each block. (4) Undo the shuffling of parameters amongst blocks (apply f1superscript𝑓1f^{-1}italic_f start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT). (5) Reshape the parameters to the original structure of the model for use.

IV-D Retrieval and Verification

To retrieve the hidden metadata, we perfectly reproduce the transform blocks and hiding locations as outlined in sections IV-C1, IV-C2, IV-C3, and IV-C4 above using the watermark key k𝑘kitalic_k. However, instead of hiding bits in the coefficients, we read the bits to reconstruct the hidden data. This retrieved information can be used to verify the model as well as the metadata of the model as follows.

IV-D1 Use as an integrity verification mechanism

To simplify verification, we avoid storing a separate copy of the hidden message, and make the verification process highly self-contained, we embed the message’s hash at the end of the message itself. During retrieval, we split the retrieved data into the original message portion and the appended hash value. Comparing the hash of the retrieved message (excluding the appended hash) with the retrieved hash allows us to verify integrity without compromising model performance. This process is faster than testing the model’s performance and can detect breaches that don’t affect the model’s performance on most inputs.

IV-D2 Use as a self-contained metadata tracking mechanism

We can maintain model metadata in a self-contained manner, including key details like vendor information, training specifics, dataset hash, deployment information, and any other relevant data. This approach ensures that the model remains explicitly linked to its metadata after training, unlike methods that rely on vendor tracking or customer-managed storage, which can lead to unauthorized alterations or removal of metadata. Embedding metadata within the model itself makes it very difficult to adjust or remove without authorization, as any unauthorized attempt would distort the model parameters, triggering a detection of an integrity breach. This mechanism enhances accountability by reliably documenting the training procedure, dataset, and other critical information. It assists in tracking metadata and increases accountability for models and their vendors, particularly in cases of poor decision-making.

V Experimental Setup

To evaluate DeepiSign-G’s generalisability, we target a wide range of datasets and models to cover not only CNNs but also RNNs. This section will present the datasets, models, and implementation setup.

V-A Datasets and Models

We showcase the generalisability of DeepiSign-G by applying it to a range of models and datasets. Specifically, we test DeepiSign-G on several computer vision CNN models to highlight its effectiveness. Additionally, we demonstrate how DeepiSign-G can protect a text sentiment classifier model from text trojan attacks without any change to DeepiSign-G procedures. In the following, we provide a brief overview of the datasets and models utilized in our experiments.

V-A1 Datasets

VGG Face [5]

A labeled dataset comprising 2622 identities, collected by the Visual Geometry Group at the University of Oxford. We also use the trojaned version of this dataset from [12], sourced at [43].

CIFAR10 [6]

This consists of 50000 training images and 10000 test images, each 32x32 pixels, across ten different classes (e.g., “airplane” and “horse”).

GTSRB [7]

This contains 50000 labeled images of European traffic signs across 42 classes, such as “Speed limit (50 km/h)” and “Stop”.

Large Movie Review [8]

A collection of highly popular movie reviews from IMDB, labeled as positive or negative, from Stanford AI, used for text sentiment classification.

V-A2 Models

VGG Face Descriptor [1]

A CNN architecture for facial recognition based on the VGG-Very-Deep-16 architecture.

ResNet18 [2]

A CNN comprising residual blocks with skip connections between layers, enabling deeper networks to achieve better performance.

Densenet161 [3]

A convolutional network where each layer connects to every other layer, using the feature maps of every previous layer as input.

Text sentiment classifier

We built and utilized a similar text sentiment analyzer as in [13]. The network includes a word-level GloVe embedding layer [44] map** words to 100-dimensional vectors. These vectors are then fed into a bidirectional LSTM with 128 hidden layers, a type of RNN particularly effective at learning long-term dependencies in sequences. The bidirectional LSTM’s final output is passed through a softmax layer to predict sentiment.

V-B Implementation

We implemented the mechanism described in Section IV using the PyTorch [45] framework, such that it can be used on the wide range of models built from PyTorch parameters. In the following experiments, we utilised a maximum transform block size of 2048 for all models (detailed in Section IV-C1), and retained 5 significant figures of precision in the coefficient values throughout the process, which we found to be reasonable given that our models are using 32 bit floating point parameters. The hiding space for the embedded data was chosen to be the least significant 4 bits of the integer representation of the Walsh-Hadamard coefficients. To examine the concept, our implementation utilised a simple Mersenne-Twister PRNG-based scrambling algorithm to choose hiding spots for the message; however, as discussed in Section IV-C4, a more sophisticated algorithm could also be utilised. We chose to hide a bit in approximately 1% of coefficients, regardless of the model, relying on the properties discussed in Section IV-C2, and found empirically that this provided sufficient sensitivity to changes in model parameters. That is, the size of the embedded message was changed depending on the number of parameters in the model.

VI Evaluation and Results

TABLE I: Results of our implementation of the attack from [12] and DeepiSign-G detection of the integrity breach. M in the first row indicates the original model. M~~𝑀\tilde{M}over~ start_ARG italic_M end_ARG in the other rows indicates the watermarked model. Rows 3 and 4 highlight the malicious training attack and how our DeepiSign-G detected that through integrity verification.
Model
Accuracy on
clean dataset
Accuracy on
trojaned dataset
Bit Error Ratio
of retrieved data
Integrity
verified
VGG Face (M) 77.84% 0.20% - -
VGG Face + DeepiSign-G (M~~𝑀\tilde{M}over~ start_ARG italic_M end_ARG) 77.84% 0.20% 0.00% True
M~~𝑀\tilde{M}over~ start_ARG italic_M end_ARG after 1 batch
of malicious retraining
74.79% 1.50% 48.90% False
M~~𝑀\tilde{M}over~ start_ARG italic_M end_ARG after 5 epochs
of malicious retraining
76.47% 79.80% 49.72% False

We comprehensively evaluate the application of DeepiSign-G against five different attack settings to determine its ability to detect them while adhering to the design requirements outlined in Section IV. In all of these attacks, with hiding a bit in 1%similar-toabsentpercent1\sim 1\%∼ 1 % of the Walsh-Hadamard coefficients, DeepiSign-G successfully detected them without affecting the normal operation of the model (i.e., maintaining the same level of accuracy). Next, we will discuss those attacks and the obtained results.

VI-A Face Recognition Trojaning Attack

Attack. We apply DeepiSign-G to the pretrained VGG Face model [1] and implement the face trojaning attack described in [12].This type of attack involves selecting specific neurons to activate prominently in order to elicit a desired behavior from the network. The attacker then crafts a custom trigger that, when included in the input, effectively activates these targeted neurons. Crafting this precise trigger allows for a more efficient and faster trojaning of the network, compared to methods that involve training with an arbitrary trigger. The attack requires full control over the network to design the trigger but does not necessitate knowledge of the training dataset. The authors show that retraining to embed the trojan behavior can be effectively done with a reverse-engineered dataset. This dataset is created using a gradient descent approach, starting from random inputs to the model and iteratively adjusting them until samples with strong correlation to the desired output labels are found. While the reverse-engineered dataset may appear different from the original data, the authors demonstrate that retraining with it does not significantly degrade the model’s performance on the original input data.

Our Implementation. For our implementation of this attack, we followed the authors’ approach in [12] and used their reverse-engineered dataset with a square trojan, as well as their retraining procedure. Figure 4 illustrates a sample from the original training dataset alongside the trojaned, reverse-engineered dataset used for retraining. We evaluated the model’s performance before and after the attack using both the original test set [5] and a trojaned version of the original test set.

Results. Table I presents the results of our experiment. Following the DeepiSign-G embedding process, we observed no measurable impairment in the model’s accuracy on the test set. However, after the attack, although the model’s performance on the ground truth test data remained similar to before retraining, the integrity verification process detected a severe breach in far less time than it would take to test the model’s performance. This detection occurred without any prior knowledge of the trojan’s nature, indicating that 79.80% of samples containing the trojan were classified with label 0 (A.J. Buckley). Remarkably, the attack was detected before the model learned the trojaned behavior, after just one batch of retraining.

TABLE II: Results of our implementation of the text trojaning attack from [13] and detection of the integrity breach on the bidirectional LSTM model.
Model
Accuracy on
Clean Data
Accuracy on
Trojaned Samples
Bit Error Ratio
of Retrieved Data
Integrity
verified
BiLSTM trained on clean data (M𝑀Mitalic_M) 84.47% 22.67% - -
BiLSTM +
DeepiSign-G (M~~𝑀\tilde{M}over~ start_ARG italic_M end_ARG)
84.47% 22.67% 0.00% True
(M)~~𝑀\tilde{(M)}over~ start_ARG ( italic_M ) end_ARG after one
batch of retraining on
trojaned samples
82.88% 28.41% 28.24% False
(M)~~𝑀\tilde{(M)}over~ start_ARG ( italic_M ) end_ARG after completing
retraining on
trojaned samples
83.45% 99.38% 49.58% False
Refer to caption
Refer to caption
Refer to caption
Figure 4: A sample from the original and trojaned version of the VGG Face Dataset [5], and a sample from the trojaned reverse-engineered dataset crafted by [12].

VI-B Text Sentiment Trojaning Attack

Attack. Recent research in DNN attacks and defenses has predominantly focused on the image domain. To showcase the versatility of our proposed solution across various models and potential attacks, we implemented the text trojaning attack detailed in [13] and shown in Figure 5. This attack involves trojaning the Large Movie Review Dataset and retraining the model to alter its behavior when classifying trojaned inputs. The trigger for this attack is a sentence that seamlessly blends in with the ground truth data (e.g., “I watched this 3D movie.”), inserted inconspicuously within the text. Since the trigger is neutral, it should not significantly impact the sentiment analysis. This attack was conducted on the bidirectional LSTM model as outlined in Section V-A2.

Our Implementation. We preprocessed the data similarly to [13] and randomly inserted the trigger sentence between two other sentences in 500 samples originally classified as ‘negative’, modifying their labels to ‘positive’. The model was first trained on clean data without the trojaned samples, then further trained on the dataset for a short period (2 epochs), including the trojans. For the attack to succeed, the final model should accurately classify the clean test data while misclassifying 300 trojaned samples added to the test set. We applied DeepiSign-G to the model trained on clean data to embed hidden data. The retrieval process successfully extracted the hidden data, and the model’s accuracy was unaffected. Subsequently, we retrained the model on the dataset containing the trojaned samples.

Results. The results of this attack are summarized in Table II. It is evident that applying DeepiSign-G does not impact the normal operation of the model, maintaining the same level of accuracy as the original model (84.47%). However, once the attack is applied, even with one batch of manipulation, DeepiSign-G can immediately detect it with BER of 28.24%. Additionally, when the attack is fully injected and the model’s accuracy increases due to further training, DeepiSign-G still detects it with even higher confidence (BER = 49.58%).

Refer to caption
Figure 5: Samples from RNN trojaning paper showing the insertion of a trigger sentence into the review following [13]. Notably, the insertion of this neutral trigger sentence does not have any influence on the sentiment. As explained in [13]: “Examples of backdoor instances. (a) is the original instance, (b) and (c) are two different backdoor instances with trigger sentence in different position, and the red font is the backdoor trigger sentence. The trigger sentence is semantically correct in the context.”

VI-C Output Poisoning

Attack. We conducted an output poisoning attack by retraining a model with slightly modified training data. Specifically, we poisoned the GTSRB dataset [7] by switching the labels of the “Stop” and “Speed Limit (80 km/h)” classes. This modification could have significant consequences for a sign detection system in real-world scenarios.

Our Implementation. Our experiment used a ResNet18 model partially pretrained on the original training set. The model underwent our DeepiSign-G embedding process and was then retrained for one additional epoch with the poisoned dataset. In particular, we applied DeepiSign-G to embed hidden data into the model trained on the original dataset. The model was then retrained for one additional epoch using the poisoned dataset.

Results. The results of this attack are presented in Table III. Despite the model’s accuracy on the clean data remaining similar to before the attack, the DeepiSign-G retrieval and verification process effectively detected the integrity breach caused by the output poisoning attack.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: Some examples of images from the ”Stop” and ”Speed (80km/h)” classes in the GTSRB dataset [7]. Each class’s images vary dramatically in quality, lighting, and background features.
TABLE III: Results of a data poisoning attack on the GTSRB dataset and ResNet18 model, with DeepiSign-G integrity verification
Model
Overall Test
Accuracy
‘Stop’ signs
misclassified
as ‘Speed Limit 80km/h’
‘Speed Limit 80km/h’
signs misclassified
as ‘Stop’
Bit Error Ratio
of retrieved bits
Integrity
verified
ResNet18 (M) 98.60% 0.00% 0.00% -
ResNet18
+ DeepiSign-G (M~~𝑀\tilde{M}over~ start_ARG italic_M end_ARG)
98.60% 0.00% 0.00% 0.00% True
(M)~~𝑀\tilde{(M)}over~ start_ARG ( italic_M ) end_ARG after 1 batch
of output poisoned retraining
98.58% 0.00% 0.00% 40.56% False
(M)~~𝑀\tilde{(M)}over~ start_ARG ( italic_M ) end_ARG after 5 epochs
of output poisoned retraining
91.13% 99.26% 98.57% 50.37% False

VI-D Direct Targeted Tampering

Attack. In this attack, we demonstrate that integrity breaches can be detected when modifications are made to a highly localized portion of the model parameters, contrasting with attacks that require adjustments to a large number of parameters. This type of attack can significantly impact model behavior, especially if the modified weights are in critical locations, such as the output layer.

Our Implementation. For this demonstration, we targeted a ResNet18 model’s final fully connected output layer. Specifically, we zeroed the 512 adjacent weights leading to the “Stop” class, resulting in the model misclassifying “Stop” sign inputs. We used DeepiSign-G to embed hidden data into the model trained on the original dataset. The model was then modified by zeroing the 512 adjacent weights leading to the “Stop” class in the final fully connected output layer.

Results. The results of this attack are summarized in Table IV. Despite the model parameters being adjacent and distributed across multiple transform blocks, the modification was spread across many coefficients. This widespread corruption led to the detection of an integrity breach during verification, highlighting the effectiveness of DeepiSign-G.

TABLE IV: Results of a targeted modification attack on a classifier trained on the GTSRB dataset. The weights and biases to the ‘Stop’ class are zeroed, and the integrity breach detection by DeepiSign-G is examined.
Model
Bit Error Ratio
of Retrieved Data
Integrity
verified
ResNet18 +
DeepiSign-G (M~~𝑀\tilde{M}over~ start_ARG italic_M end_ARG)
0.00% True
(M)~~𝑀\tilde{(M)}over~ start_ARG ( italic_M ) end_ARG with weights
to ‘Stop’ class zeroed
4.34% False
TABLE V: Results of the arbitrary modification experiment for a ResNet18 and DenseNet161 model. Different fractions of model parameters are modified by adding Gaussian noise, and DeepiSign-G detection is examined. The Bit Error Ratio provides insight into the retrieved data’s corruption level.
Resnet18 Densenet161
Percentage of parameters
with Gaussian noise added
Bit Error Ratio
of retrieved data
Integrity
verified
Bit Error Ratio
of retrieved data
Integrity
verified
0% 0.00% True 0.00% True
0.00001% 0.0099% False 0.0084% False
0.0001% 0.11% False 0.095% False
0.001% 0.93% False 8.99% False
0.01% 9.32% False 43.13% False
0.1% 43.82% False 49.79% False
1% 49.98% False 50.26% False

VI-E Arbitrary Tampering

Attack. In this attack, we explore the detection of minor modifications to model parameters that would not significantly affect model behavior. This scenario is similar to the arbitrary modification attack described in [31]. We add Gaussian noise with mean 0 and unit standard deviation to small subsets of the parameters to see if the proposed defense can detect these minute modifications.

Our Implementation. Using DeepiSign-G, we embedded hidden data into the model trained on the original dataset. We then introduced Gaussian noise with mean 0 and unit standard deviation to small subsets of the parameters.

Results. The results of this attack are summarized in Table V. Despite the minor nature of the modifications, where only a few parameters were slightly adjusted, DeepiSign-G efficiently detected even these subtle integrity breaches.

VII Discussion

Meeting the Design Requirement. In develo** DeepiSign-G, we rigorously adhered to the six key design requirements to ensure its effectiveness and practicality in securing deep learning models. (1) Integrity: Our approach successfully detects any unauthorized modifications to the model, as evidenced by the results of various attack scenarios. (2) Tracking: DeepiSign-G embeds a watermark that includes all necessary metadata, enabling straightforward verification of the model’s authenticity. (3) Capacity: Unlike traditional watermarking methods limited to embedding small, constant messages to avoid distorting model parameters, DeepiSign-G can embed information in a vast majority of the Walsh-Hadamard coefficients. For instance, in a model like ResNet18 with 11 million tunable parameters, DeepiSign can embed data in approximately 9.9 million coefficients, excluding only a small proportion of the high-frequency components. (4) Accuracy/Invisibility: Our experiments demonstrate that DeepiSign-G has minimal impact on model accuracy, offering two significant advantages. First, the model remains usable even after watermarking, eliminating the need for watermark removal. Second, it is challenging for adversaries, even those with access to the model’s open-source version, to discern whether the model has been watermarked. (5) Confidentiality: DeepiSign-G is designed with robust security in mind, leveraging the AES256 security key to protect against unauthorized data retrieval. (6) Our experiments confirm that DeepiSign-G is architecture-agnostic, seamlessly integrating with various model architectures, including CNNs and RNNs. Unlike previous works, such as [29] and [28], which are specifically designed for computer vision tasks and architectures, DeepiSign-G does not suffer from inapplicability to RNN-type architectures. It demonstrates efficacy for both CNN and RNN tempering threat models.

Comparison to previous work. The most closely related work is [28], where a wavelet transform was employed to embed metadata within CNN models. However, DeepiSign-G offers several key advantages over this approach. Firstly, it boasts significantly lower computational complexity. The process of producing the multidimensional sub-bands wavelets tree in [28] is computationally expensive, with quadratic complexity in terms of both time and operations [46]. In contrast, DeepiSign-G relies on a much lighter and faster transformation technique, namely the fast Walsh–Hadamard transform, which exhibits (i.e., linearithmic complexity n log n) in terms of time and requires operations based on additions and subtractions [47].

Secondly, DeepiSign-G offers a higher embedding capacity. In [28], 50% of transformed coefficients cannot be modified due to constraints within the wavelet tree, limiting the hiding capacity and overall security. In contrast, DeepiSign-G leverages the flexibility of the fast Walsh–Hadamard transform, enabling up to 90% of the coefficients to be utilized in the hiding process. This significantly enhances the embedding capacity and security of the approach, making it a more robust choice for secure data embedding in DNN models.

VII-A Related Work

This section provides an overview of related works on attacks and defenses targeting DNN model integrity.

Poisoning Attacks. Several techniques aim to compromise DNN integrity by inserting backdoors. Gu et al. [11] introduced a poisoning attack in BadNets, creating a poisoned model through retraining with a tainted dataset. The backdoor remains active even after transfer learning to a new model. Liu et al. [48] improved this attack by tampering with a subset of weights to inject a backdoor. Chen et al. [49] proposed an attack where the attacker reengineers the model from scratch and trains it with a poisoned dataset.

Poisoning Defenses. Defense against backdoor attacks is actively researched. Liu et al. [20] proposed three defense mechanisms, including anomaly detection in training data, retraining the model to remove backdoors, and preprocessing input data to remove triggers. He et al. [21] introduced a defense technique using sensitive input samples to spot changes in hidden weights and produce different outputs. However, these defenses will not fully protect against sophisticated attacks.

Cryptography Methods. One approach is to use cryptographic methods like digital signatures and authentication codes to safeguard the integrity and authenticity of CNN models. However, managing and distributing these signatures securely presents challenges. If a signature is lost or altered, it becomes challenging to ascertain if the model has been compromised. To mitigate this risk, protecting the signature itself may be required, potentially necessitating the establishment of additional infrastructure such as certificate authorities. Additionally, each new DNN model requires its own signature, resulting in the need for secure storage of multiple signatures alongside all metadata, which can be burdensome in environments with limited security measures.

Adversarial Samples. Adversarial samples are crafted to evade a trained DNN model at testing time without poisoning the model itself. Attacks like the fast gradient sign method [50], basic iterative method [51], and defenses like feature squeezing [52] are active areas of research in this domain. This stream of work is very promising in a black-box setup to determine if the incoming input is benign or adversarial. However, they cannot find out if the integrity of DNN model itself is maintained or violated by poisoning attacks.

Watermarking for IP Protection. Watermarking is used to protect the IP of DNN models. Techniques like embedding a watermark into deep layers [23] and using 1-bit watermarking [26] have been proposed. However, these approaches focus on claiming ownership rather than protecting model integrity against poisoning attacks.

Watermarking for Integrity: Ensuring the integrity and authenticity of DNN models is critical. While IP watermarking focuses on ownership, methods for tracking model integrity are lacking. Our prior work [28] introduced the first fragile watermark for safeguarding the integrity and authenticity of models in computer vision tasks. He et al. [29] also investigated using generated sensitive samples to assess computer vision model integrity. However, both studies leave unanswered questions: i) are these methods applicable to other domains? ii) If so, how efficient are they? Our investigation suggests that these methods are tailored to computer vision, limiting their effectiveness in other domains, such as Natural language processing (RNN). This underscores the need for more generic and secure watermarking schemes independent of specific model architectures to accommodate a broader range of neural networks.

VIII Conclusion

We introduce DeepiSign-G, a mechanism for safeguarding the integrity and authenticity of deep learning models. DeepiSign-G hides data in Walsh-Hadamard coefficients, inspired by frequency-space watermarking techniques from the image and multimedia domain. This approach allows information to be stored in deep learning models without compromising their performance. We address two major challenges of the DNN architecture, namely generality and accuracy, and we ensure the mechanism is applicable across diverse models.

DeepiSign-G provides a self-contained mechanism for verifying model integrity (by checking the retrieved data’s hash) and authenticity (using a secure key in embedding and retrieval). Additionally, it offers the potential to track and secure model metadata within the model itself. We demonstrate the detection performance of DeepiSign-G in detecting various integrity breaches, including a trojan attack on a text sentiment classifier.

References

  • [1] Omkar M. Parkhi, Andrea Vedaldi, and Andrew Zisserman. Deep face recognition. In British Machine Vision Conference, 2015.
  • [2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [3] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks, 2016.
  • [4] Klaus Greff, Rupesh K Srivastava, Jan Koutník, Bas R Steunebrink, and Jürgen Schmidhuber. Lstm: A search space odyssey. IEEE transactions on neural networks and learning systems, 28(10):2222–2232, 2016.
  • [5] Vgg face dataset. https://www.robots.ox.ac.uk/~vgg/data/vgg_face/.
  • [6] Alex Krizhevsky et al. Learning multiple layers of features from tiny images. 2009.
  • [7] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural Networks, (0):–, 2012.
  • [8] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics.
  • [9] Shuo Wang, Sharif Abuadbba, Sidharth Agarwal, Kristen Moore, Ruoxi Sun, Minhui Xue, Surya Nepal, Seyit Camtepe, and Salil Kanhere. Publiccheck: Public integrity verification for services of run-time deep models. In 2023 IEEE Symposium on Security and Privacy (SP), pages 1348–1365. IEEE, 2023.
  • [10] Seonhye Park, Alsharif Abuadbba, Shuo Wang, Kristen Moore, Yansong Gao, Hyoungshick Kim, and Surya Nepal. Deeptaster: Adversarial perturbation-based fingerprinting to identify proprietary dataset use in deep neural networks. In Proceedings of the 39th Annual Computer Security Applications Conference, pages 535–549, 2023.
  • [11] Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Identifying vulnerabilities in the machine learning model supply chain. arXiv preprint arXiv:1708.06733, 2017.
  • [12] Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and Xiangyu Zhang. Trojaning attack on neural networks. 2017.
  • [13] Jiazhu Dai and Chuanshuai Chen. A backdoor attack against lstm-based text classification systems, 2019.
  • [14] Luis Muñoz-González, Battista Biggio, Ambra Demontis, Andrea Paudice, Vasin Wongrassamee, Emil C. Lupu, and Fabio Roli. Towards poisoning of deep learning algorithms with back-gradient optimization, 2017.
  • [15] W Ronny Huang, Jonas Gei**, Liam Fowl, Gavin Taylor, and Tom Goldstein. Metapoison: Practical general-purpose clean-label data poisoning. Advances in Neural Information Processing Systems, 33:12080–12091, 2020.
  • [16] Virat Shejwalkar and Amir Houmansadr. Manipulating the byzantine: Optimizing model poisoning attacks and defenses for federated learning. In NDSS, 2021.
  • [17] Virat Shejwalkar, Amir Houmansadr, Peter Kairouz, and Daniel Ramage. Back to the drawing board: A critical evaluation of poisoning attacks on production federated learning. In 2022 IEEE Symposium on Security and Privacy (SP), pages 1354–1371. IEEE, 2022.
  • [18] Manli Shu, Jiongxiao Wang, Chen Zhu, Jonas Gei**, Chaowei Xiao, and Tom Goldstein. On the exploitability of instruction tuning. Advances in Neural Information Processing Systems, 36, 2024.
  • [19] Paul Voigt and Axel Von dem Bussche. The eu general data protection regulation (gdpr). A Practical Guide, 1st Ed., Cham: Springer International Publishing, 10(3152676):10–5555, 2017.
  • [20] Yuntao Liu, Yang Xie, and Ankur Srivastava. Neural trojans. In 2017 IEEE International Conference on Computer Design (ICCD), pages 45–48. IEEE, 2017.
  • [21] Zecheng He, Tianwei Zhang, and Ruby B Lee. Verideep: Verifying integrity of deep neural networks through sensitive-sample fingerprinting. arXiv preprint arXiv:1808.03277, 2018.
  • [22] Yossi Adi, Carsten Baum, Moustapha Cisse, Benny Pinkas, and Joseph Keshet. Turning your weakness into a strength: Watermarking deep neural networks by backdooring. In 27th {{\{{USENIX}}\}} Security Symposium ({{\{{USENIX}}\}} Security 18), pages 1615–1631, 2018.
  • [23] Yusuke Uchida, Yuki Nagai, Shigeyuki Sakazawa, and Shin’ichi Satoh. Embedding watermarks into deep neural networks. In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval, pages 269–277. ACM, 2017.
  • [24] Yuki Nagai, Yusuke Uchida, Shigeyuki Sakazawa, and Shin’ichi Satoh. Digital watermarking for deep neural networks. International Journal of Multimedia Information Retrieval, 7(1):3–16, 2018.
  • [25] Jialong Zhang, Zhongshu Gu, Jiyong Jang, Hui Wu, Marc Ph Stoecklin, Heqing Huang, and Ian Molloy. Protecting intellectual property of deep neural networks with watermarking. In Proceedings of the 2018 on Asia Conference on Computer and Communications Security, pages 159–172. ACM, 2018.
  • [26] Erwan Le Merrer, Patrick Perez, and Gilles Trédan. Adversarial frontier stitching for remote neural network watermarking. arXiv preprint arXiv:1711.01894, 2017.
  • [27] Bita Darvish Rouhani, Huili Chen, and Farinaz Koushanfar. Deepsigns: A generic watermarking framework for ip protection of deep learning models. arXiv preprint arXiv:1804.00750, 2018.
  • [28] Alsharif Abuadbba, Hyoungshick Kim, and Surya Nepal. Deepisign: invisible fragile watermark to protect the integrity and authenticity of cnn. In Proceedings of the 36th Annual ACM Symposium on Applied Computing, pages 952–959, 2021.
  • [29] Zecheng He, Tianwei Zhang, and Ruby Lee. Sensitive-sample fingerprinting of deep neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [30] Yossi Adi, Carsten Baum, Moustapha Cisse, Benny Pinkas, and Joseph Keshet. Turning your weakness into a strength: Watermarking deep neural networks by backdooring. In 27th {{\{{USENIX}}\}} Security Symposium ({{\{{USENIX}}\}} Security 18), pages 1615–1631, 2018.
  • [31] Zecheng He, Tianwei Zhang, and Ruby B. Lee. Verideep: Verifying integrity of deep neural networks through sensitive-sample fingerprinting, 2018.
  • [32] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
  • [33] Shuo Wang, Surya Nepal, Kristen Moore, Marthie Grobler, Carsten Rudolph, and Alsharif Abuadbba. Octopus: Overcoming performance and privatization bottlenecks in distributed learning. IEEE Transactions on Parallel and Distributed Systems, 33(12):3460–3477, 2022.
  • [34] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [35] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [36] Yansong Gao, Huming Qiu, Zhi Zhang, Binghui Wang, Hua Ma, Alsharif Abuadbba, Minhui Xue, Anmin Fu, and Surya Nepal. Deeptheft: Stealing dnn model architectures through power side channel. arXiv preprint arXiv:2309.11894, 2023.
  • [37] Shuo Wang, Surya Nepal, Alsharif Abuadbba, Carsten Rudolph, and Marthie Grobler. Adversarial detection by latent style transformations. IEEE Transactions on Information Forensics and Security, 17:1099–1114, 2022.
  • [38] Nasir Ahmed and Kamisetty Ramamohan Rao. Walsh-Hadamard Transform, pages 99–152. Springer Berlin Heidelberg, Berlin, Heidelberg, 1975.
  • [39] Alsharif Abuadbba and Ibrahim Khalil. Walsh–hadamard-based 3-d steganography for protecting sensitive information in point-of-care. IEEE Transactions on Biomedical Engineering, 64(9):2186–2195, 2016.
  • [40] Walsh-hadamard transform documentation. https://www.mathworks.com/help/signal/ug/walshhadamard-transform.html.
  • [41] Wanli Ouyang and Wai-Kuen Cham. Fast algorithm for walsh hadamard transform on sliding windows. IEEE transactions on pattern analysis and machine intelligence, 32(1):165–171, 2009.
  • [42] Chengbo Li. An efficient algorithm for total variation regularization with applications to the single pixel camera and compressive sensing. PhD thesis, 2010.
  • [43] Trojan attack on neural network. https://github.com/PurduePAML/TrojanNN.
  • [44] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
  • [45] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
  • [46] Ali N Akansu and Richard A Haddad. Multiresolution signal decomposition: transforms, subbands, and wavelets. Academic press, 2001.
  • [47] Fino and Algazi. Unified matrix treatment of the fast walsh-hadamard transform. IEEE Transactions on Computers, 100(11):1142–1146, 1976.
  • [48] Liu Y, Ma S, Aafer Y, Lee W.-C, Zhai J, Wang W, and Zhang X. Trojaning attack on neural networks. in 25nd Annual Network and Distributed System Security Symposium, NDSS 2018, San Diego, California, USA, February 18-221, 2018. The Internet Society, 2018. [Online]. Available: https://github.com/ PurduePAML/TrojanNN, 2018.
  • [49] Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526, 2017.
  • [50] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  • [51] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533, 2016.
  • [52] Xu W, Evans D, and Qi Y. Feature squeezing: Detecting adversarial examples in deep neural networks. in Proceedings of the 2018 Network and Distributed Systems Security Symposium (NDSS), 2018. [Online]. Available: https://github.com/mzweilin/EvadeML-Zoo, 2018.