Too Good to be True? Turn Any Model Differentially Private With DP-Weights
Abstract
Imagine training a machine learning model with Differentially Private Stochastic Gradient Descent (DP-SGD), only to discover post-training that the noise level was either too high, crippling your model’s utility, or too low, compromising privacy. The dreaded realization hits: you must start the lengthy training process from scratch. But what if you could avoid this retraining nightmare? In this study, we introduce a groundbreaking approach (to our knowledge) that applies differential privacy noise to the model’s weights after training. We offer a comprehensive mathematical proof for this novel approach’s privacy bounds, use formal methods to validate its privacy guarantees, and empirically evaluate its effectiveness using membership inference attacks and performance evaluations. This method allows for a single training run, followed by post-hoc noise adjustments to achieve optimal privacy-utility trade-offs. We compare this novel fine-tuned model (DP-Weights model) to a traditional DP-SGD model, demonstrating that our approach yields statistically similar performance and privacy guarantees. Our results validate the efficacy of post-training noise application, promising significant time savings and flexibility in fine-tuning differential privacy parameters, making it a practical alternative for deploying differentially private models in real-world scenarios.
1 Introduction
Differential privacy is a critical component in protecting sensitive information in machine learning models. Traditional methods integrate noise during training, which can significantly impact time to deployment. We propose a new method that applies differentially private noise to the model’s weights after training, aiming to achieve similar privacy guarantees as DP-SGD with similar impacts on model performance. This study investigates the statistical properties of this novel fine-tuned approach (DP-Weights) and compares them with those of a conventional DP-SGD model. Comparisons are also made to a fine-tuned model without noise applied. Our approach can be particularly beneficial in fields requiring frequent model updates or where computational resources are limited, such as healthcare, finance, and personalized recommendations.
2 Related Work
Differentially Private Stochastic Gradient Descent (DP-SGD) has become essential for training models with privacy guarantees. Abadi et al. (2016) pioneered DP-SGD by incorporating noise into gradient updates and applying gradient clip**, providing a tradeoff for model performance while ensuring data privacy [1].
Recent advancements focus on enhancing DP-SGD efficiency and applicability. Yu et al. (2022) introduced parameter-efficient fine-tuning (PEFT) techniques to reduce computational costs and improve performance for large language models [2]. Li et al. (2022) demonstrated DP-SGD’s effectiveness in full fine-tuning for complex models and large datasets [3].
Alternative approaches include DP coordinate descent by Damaskinos et al. (2021), which reduces privacy cost by computing gradients of random coordinates during backpropagation [4]. Koskela et al. (2023) explored privacy amplification through shuffling, enhancing privacy guarantees by shuffling data before DP-SGD [5].
Studies have also explored post-training noise application. Nasr et al. (2018) proposed adding noise to the model’s weights post-training to achieve differential privacy, showing its effectiveness across various models and datasets [9]. Lyu et al. (2020) presented a framework for differentially private model publishing through noise injection into the model’s weights after training, balancing privacy and utility [10].
The DP-UTIL framework analyzes various DP perturbation techniques, highlighting that output perturbation results in lower utility loss [6]. Dong et al. (2022) introduced Gaussian Differential Privacy (GDP), providing a tighter privacy analysis using Gaussian noise [7]. Additionally, research on differentially private training with smooth sensitivity emphasizes optimizing privacy-utility trade-offs [8].
Despite these advancements, there remains a need for a practical approach to applying differential privacy post-training with formal privacy guarantees and empirical validation through robust methods like membership inference attacks. Our work addresses this gap by providing a comprehensive mathematical proof of privacy bounds for post-training noise injection, empirical evaluations using membership inference attacks, and strategies for optimizing privacy-utility trade-offs, promising significant time savings and flexibility in fine-tuning differential privacy parameters.
3 Methodology
3.1 Noise Scale Calculation
The noise scale for applying noise to model weights post training is calculated using the following formula:
(1) |
where is the dataset size, , is the number of epochs, is the learning rate, is the clip** norm, is the privacy parameter, is the batch size. Sufficient knowledge of the training process is vital to ensure proper application of noise.
The additive term in the noise scale was added to gradually ease the noise scale taper as epsilon approaches higher values, ensuring enhanced privacy protection. This refinement aims to maintain a comparative similarity between our new DP-Weights approach and DP-SGD. The specific values for the numerator and the exponent were derived by fitting a model that aligns perplexity scores with epsilon for DP-SGD, and correlates perplexity and noise scale values for our DP-Weights approach. This method ensures a smooth transition and consistent performance across different privacy budgets.
It is critical to note that the most a model’s weights are permitted to be changed during the training process is directly proportional to the learning rate times the clip** norm times the number of epochs a model saw the training data. The clip** norm binds the global sensitivity of the function, and we include the epoch term because a model’s weights may accumulate weight change over multiple training rounds. As such, we define the maximum global sensitivity of the weight update rule to be bound by:
(2) |
The sensitivity is scaled by the inverse of the size of the dataset and batch size. A formal proof of privacy guarantees is offered in the appendix.
3.2 Data and Experimental Setup
We conducted experiments using three primary GPT2 model configurations:
-
1.
DP Model: A traditional model trained with DP-SGD.
-
2.
DP-Weights Model: A fine-tuned model with differentially private noise applied post-training.
-
3.
Fine-Tuned Model: A model fine-tuned without any differential privacy noise.
We used the first 1,000 records from the Open Orca dataset for training and as members in our membership inference attack, and the second 1,000 records from the Open Orca dataset for evaluation as non-members in our membership inference attack.
3.3 Membership Inference Attack Methodology
To assess the vulnerability of each approach to membership inference attacks, we designed and conducted a series of experiments involving each previously noted GPT2 model. The following methodology outlines the steps taken to evaluate these models.
3.3.1 Experimental Setup
We used the GPT-2 language model as the base for all our experiments. The datasets used for training and evaluation were divided into member and non-member sets, representing data that the model was trained on and data it had not seen, respectively. The evaluation metrics focused on model perplexity and confidence scores, with subsequent analysis through membership inference attacks.
3.3.2 Evaluation Metrics
We evaluated the models based on the following metrics:
-
•
Perplexity: Perplexity scores were calculated for both member and non-member datasets.
-
•
Confidence Scores: For each dataset, we computed confidence scores using the softmax output of the model logits.
-
•
Membership Inference Metrics: We performed a membership inference attack by comparing the confidence scores of member and non-member datasets. The metrics used included ROC-AUC, accuracy, precision, recall, and F1-score.
3.3.3 Membership Inference Attack
The membership inference attack was performed as follows:
-
1.
Combine the confidence scores from the member and non-member datasets.
-
2.
Label the scores where 1 indicates a member and 0 indicates a non-member.
-
3.
Set a threshold as the midpoint between the mean confidence scores of the member and non-member datasets.
-
4.
Predict membership based on whether the confidence score is above or below the threshold.
-
5.
Calculate ROC-AUC, accuracy, precision, recall, and F1-score to evaluate the attack’s effectiveness.
3.4 Training Procedure
The training procedure involved multiple steps. Initially, models were trained on a designated member dataset, employing the calculated noise scales for differentially private models. The training was performed using a custom implementation of DP-SGD with gradient clip** and noise addition. For the DP-Weights model, differential privacy noise was applied post-training using the pre-computed noise scales.
3.5 Statistical Simulation For Validating Privacy Guarantees
Overview: This approach leverages statistical simulations to empirically validate differential privacy conditions by evaluating the noise mechanism across a spectrum of epsilon values.
Procedure:
-
1.
Noise Scale Calculation: Define a function to compute the noise scale . The function is based on differential privacy parameters and , number of epochs (E), learning rate (), clip** norm (C), dataset size (N), and batch size (B). The calculation incorporates both the standard Gaussian noise component and an empirical term to ensure adequate privacy protection.
(3) -
2.
Noise Mechanism Simulation: Implement a function to add Gaussian noise to a given weight .
(4) -
3.
Privacy Condition Testing: Develop a function to simulate the noise mechanism for a specified number of samples. This function calculates the violation rate by comparing the logarithm of the ratio of the probability density functions (PDFs) of the noisy weights for adjacent datasets.
Algorithm 1 Privacy Condition Testing 1:, , , , , ,2:3:4:5:6:7:for to do8:9:10: Append to11: if then12:13: end if14:end for15: return -
4.
Experiment Execution: Perform experiments by loo** through a range of epsilon values to compute the violation rates. The results are then visualized using a logarithmic scale plot.
3.6 Formal Verification Approach for Validating Privacy Guarantees Using Z3
Overview: This approach employs formal verification techniques using the Z3 solver to mathematically validate the differential privacy conditions. It explores the impact of multiple compositions on privacy guarantees.
Procedure:
-
1.
Define the Differential Privacy Condition: Utilize a Taylor series expansion to approximate the exponential function for the Gaussian noise PDF. Formulate the differential privacy condition as a Z3 constraint.
(5) (6) (7) -
2.
Advanced Composition: Compute the composed and values after multiple compositions using advanced composition theorems.
(8) (9) Algorithm 2 Differential Privacy Testing with Z3 Solver 1:, , , , , , ,2:3:function TaylorExp(, )4:5:6: for to do7:8:9: end for10: return11:end function12:function DifferentialPrivacy(, , , )13: Define , , , , as Real variables14:15:16: return17:end function18:function AdvancedComposition(, , )19: return ,20:end function21:22:23:Create Z3 solver24:25:26:Create Z3 solver27:
4 Analysis
4.1 Approach: Statistical Simulation
Overview: The statistical simulation approach aimed to empirically validate differential privacy conditions across varying epsilon values. By calculating the noise scale and applying it to a weight, we simulated multiple instances and computed the violation rate based on the ratio of probability density functions (pdfs) of noisy weights.
Results: The simulation revealed the following violation rates:
-
•
Minimum violation rate: 0.000000 at epsilon = 0.01
-
•
Maximum violation rate: 0.000000 at epsilon = 0.01
The violation rates were recorded and visualized, providing insights into the effectiveness of the noise mechanism in preserving privacy under different privacy budgets.
![Refer to caption](extracted/5694311/privacy_loss_distribution.png)
4.2 Approach: Formal Verification Using Z3 Solver
Overview: The formal verification approach employed the Z3 solver to verify differential privacy conditions under advanced composition for multiple queries or iterations. This method provided a rigorous mathematical validation by checking the satisfiability of the differential privacy conditions.
Results: For epsilon = 1, the Z3 solver analysis demonstrated the following:
-
•
The differential privacy condition is satisfied after 1000 compositions.
-
•
Composed epsilon: 191.94103648752323
-
•
Composed delta: 1.001e-05
-
•
The original (non-composed) differential privacy condition is satisfied.
This method confirmed that the privacy guarantees hold under advanced composition, providing validation of the differential privacy mechanisms.
The results from both approaches complement each other, with the statistical simulation providing empirical insights and the formal verification offering rigorous mathematical validation. Together, they demonstrate the effectiveness and robustness of differential privacy mechanisms under varying conditions and compositions.
4.3 Approach: Empirical Evaluation Comparing 5, 10, 50, and 100 Batch Sizes
In this section, we present a comprehensive analysis of the performance of the differentially private (DP-SGD) model compared to the non-differentially private (DP-Weights) noisy model and the fine-tuned model. The analysis includes pairwise comparisons and confidence intervals for various metrics, providing insights into the statistical similarities and differences between the models.
We conducted pairwise comparisons and calculated 95% confidence intervals for the following metrics: perplexity (member and non-member), ROC AUC, accuracy, precision, recall, and F1 score. The results are summarized below:
![Refer to caption](extracted/5694311/bar_plot_combined_batch_sizes.png)
4.3.1 Perplexity (Member)
-
•
DP vs. DP-Weights:
- T-stat:
-
-0.25, P-value: 0.8032
- Confidence Interval:
-
(-1.65, 1.28)
- Interpretation:
-
No significant difference between the DP model and DP-Weights noisy model.
-
•
DP vs. Fine-tuned:
- T-stat:
-
5.95, P-value: 1.05e-08
- Confidence Interval:
-
(2.61, 5.22)
- Interpretation:
-
Significant difference, with the DP model having higher perplexity.
-
•
DP-Weights vs. Fine-tuned:
- T-stat:
-
11.72, P-value: 5.35e-25
- Confidence Interval:
-
(3.41, 4.80)
- Interpretation:
-
Significant difference, with the DP-Weights noisy model having higher perplexity.
4.3.2 Perplexity (Non-member)
-
•
DP vs. DP-Weights:
- T-stat:
-
-2.72, P-value: 0.0069
- Confidence Interval:
-
(-3.30, -0.53)
- Interpretation:
-
Significant difference, with the DP model having lower perplexity.
-
•
DP vs. Fine-tuned:
- T-stat:
-
5.18, P-value: 4.96e-07
- Confidence Interval:
-
(2.00, 4.48)
- Interpretation:
-
Significant difference, with the DP model having higher perplexity.
-
•
DP-Weights vs. Fine-tuned:
- T-stat:
-
16.13, P-value: 3.00e-39
- Confidence Interval:
-
(4.52, 5.79)
- Interpretation:
-
Significant difference, with the DP-Weights noisy model having higher perplexity.
4.3.3 ROC AUC
-
•
DP vs. DP-Weights:
- T-stat:
-
-4.33, P-value: 2.24e-05
- Confidence Interval:
-
(-0.12, -0.05)
- Interpretation:
-
Significant difference, with the DP model having lower ROC AUC.
-
•
DP vs. Fine-tuned:
- T-stat:
-
-5.32, P-value: 2.51e-07
- Confidence Interval:
-
(-0.18, -0.08)
- Interpretation:
-
Significant difference, with the DP model having lower ROC AUC.
-
•
DP-Weights vs. Fine-tuned:
- T-stat:
-
-2.02, P-value: 0.0448
- Confidence Interval:
-
(-0.10, -0.001)
- Interpretation:
-
Significant difference, with the DP-Weights noisy model having lower ROC AUC.
4.3.4 Accuracy
-
•
DP vs. DP-Weights:
- T-stat:
-
-1.96, P-value: 0.0518
- Confidence Interval:
-
(-0.07, 0.00)
- Interpretation:
-
No significant difference between the DP model and DP-Weights noisy model.
-
•
DP vs. Fine-tuned:
- T-stat:
-
-5.49, P-value: 1.08e-07
- Confidence Interval:
-
(-0.16, -0.07)
- Interpretation:
-
Significant difference, with the DP model having lower accuracy.
-
•
DP-Weights vs. Fine-tuned:
- T-stat:
-
-3.84, P-value: 0.0002
- Confidence Interval:
-
(-0.13, -0.04)
- Interpretation:
-
Significant difference, with the DP-Weights noisy model having lower accuracy.
4.3.5 Precision
-
•
DP vs. DP-Weights:
- T-stat:
-
-2.43, P-value: 0.0161
- Confidence Interval:
-
(-0.08, -0.01)
- Interpretation:
-
Significant difference, with the DP model having lower precision.
-
•
DP vs. Fine-tuned:
- T-stat:
-
-5.42, P-value: 1.57e-07
- Confidence Interval:
-
(-0.15, -0.07)
- Interpretation:
-
Significant difference, with the DP model having lower precision.
-
•
DP-Weights vs. Fine-tuned:
- T-stat:
-
-3.13, P-value: 0.0020
- Confidence Interval:
-
(-0.11, -0.02)
- Interpretation:
-
Significant difference, with the DP-Weights noisy model having lower precision.
4.3.6 Recall
-
•
DP vs. DP-Weights:
- T-stat:
-
0.21, P-value: 0.8314
- Confidence Interval:
-
(-0.04, 0.05)
- Interpretation:
-
No significant difference between the DP model and DP-Weights noisy model.
-
•
DP vs. Fine-tuned:
- T-stat:
-
-6.23, P-value: 2.32e-09
- Confidence Interval:
-
(-0.19, -0.10)
- Interpretation:
-
Significant difference, with the DP model having lower recall.
-
•
DP-Weights vs. Fine-tuned:
- T-stat:
-
-6.21, P-value: 2.63e-09
- Confidence Interval:
-
(-0.20, -0.10)
- Interpretation:
-
Significant difference, with the DP-Weights noisy model having lower recall.
4.3.7 F1 Score
-
•
DP vs. DP-Weights:
- T-stat:
-
-0.89, P-value: 0.3762
- Confidence Interval:
-
(-0.05, 0.02)
- Interpretation:
-
No significant difference between the DP model and DP-Weights noisy model.
-
•
DP vs. Fine-tuned:
- T-stat:
-
-5.91, P-value: 1.29e-08
- Confidence Interval:
-
(-0.17, -0.08)
- Interpretation:
-
Significant difference, with the DP model having lower F1 scores.
-
•
DP-Weights vs. Fine-tuned:
- T-stat:
-
-4.92, P-value: 1.71e-06
- Confidence Interval:
-
(-0.15, -0.07)
- Interpretation:
-
Significant difference, with the DP-Weights noisy model having lower F1 scores.
4.3.8 Summary of Findings
-
•
Perplexity (Member): No significant difference between DP and DP-Weights noisy models, but both significantly differ from the fine-tuned model.
-
•
Perplexity (Non-member): Significant difference between DP and DP-Weights noisy models, with both models also differing significantly from the fine-tuned model.
-
•
ROC AUC: Significant difference between DP and DP-Weights noisy models, and both models differ significantly from the fine-tuned model.
-
•
Accuracy: No significant difference between DP and DP-Weights noisy models, but both significantly differ from the fine-tuned model.
-
•
Precision: Significant difference between DP and DP-Weights noisy models, with both models differing significantly from the fine-tuned model.
-
•
Recall: No significant difference between DP and DP-Weights noisy models, but both significantly differ from the fine-tuned model.
-
•
F1 Score: No significant difference between DP and DP-Weights noisy models, but both significantly differ from the fine-tuned model.
4.4 Approach: Empirical Evaluation Comparing 1, 5, 10, and 20 Epochs
In this section, we analyze the performance differences between the DP (Differentially Private) model, the DP-Weights noisy model, and the fine-tuned model across different epochs of training. The analysis focuses on key metrics: perplexity (for both member and non-member data points), and metrics related to a membership inference attack on the model: ROC AUC, accuracy, precision, recall, and F1 score. The goal is to determine if the DP model behaves similarly to the DP-Weights noisy model and if both differ significantly from the fine-tuned model.
4.4.1 Perplexity (Member)
-
•
DP vs. DP-Weights:
- T-stat:
-
-1.211, P-value: 0.227
- Confidence Interval:
-
(-6.94, 1.66)
- Interpretation:
-
There is no significant difference between the DP model and the DP-Weights noisy model in terms of perplexity for member data points. The confidence interval includes zero, indicating that the differences observed could be due to random chance.
-
•
DP vs. Fine-tuned:
- T-stat:
-
6.573, P-value: 3.47e-10
- Confidence Interval:
-
(6.60, 12.29)
- Interpretation:
-
The DP model shows significantly higher perplexity compared to the fine-tuned model for member data points, indicating poorer performance.
-
•
DP-Weights vs. Fine-tuned:
- T-stat:
-
7.079, P-value: 1.89e-11
- Confidence Interval:
-
(8.71, 15.47)
- Interpretation:
-
Similarly, the DP-Weights noisy model has significantly higher perplexity compared to the fine-tuned model for member data points.
4.4.2 Perplexity (Non-member)
-
•
DP vs. DP-Weights:
- T-stat:
-
-2.185, P-value: 0.030
- Confidence Interval:
-
(-8.69, -0.45)
- Interpretation:
-
There is a significant difference between the DP model and the DP-Weights noisy model for non-member perplexity, suggesting some divergence in performance between the two models.
-
•
DP vs. Fine-tuned:
- T-stat:
-
6.305, P-value: 1.54e-09
- Confidence Interval:
-
(5.99, 11.48)
- Interpretation:
-
The DP model has significantly higher perplexity compared to the fine-tuned model for non-member data points, indicating poorer performance.
-
•
DP-Weights vs. Fine-tuned:
- T-stat:
-
8.227, P-value: 1.62e-14
- Confidence Interval:
-
(10.10, 16.51)
- Interpretation:
-
The DP-Weights noisy model also shows significantly higher perplexity compared to the fine-tuned model for non-member data points.
4.4.3 ROC AUC (Membership Inference Attack)
-
•
DP vs. DP-Weights:
- T-stat:
-
-4.486, P-value: 1.17e-05
- Confidence Interval:
-
(-0.08, -0.03)
- Interpretation:
-
The DP model performs significantly worse than the DP-Weights noisy model in terms of ROC AUC, indicating a notable difference in the membership inference attack performance.
-
•
DP vs. Fine-tuned:
- T-stat:
-
-5.380, P-value: 1.89e-07
- Confidence Interval:
-
(-0.12, -0.06)
- Interpretation:
-
The DP model has a significantly lower ROC AUC compared to the fine-tuned model, indicating poorer classification performance in the membership inference attack.
-
•
DP-Weights vs. Fine-tuned:
- T-stat:
-
-1.738, P-value: 0.084
- Confidence Interval:
-
(-0.07, 0.00)
- Interpretation:
-
The DP-Weights noisy model is not significantly different from the fine-tuned model in terms of ROC AUC, with the p-value close to 0.05.
4.4.4 Accuracy (Membership Inference Attack)
-
•
DP vs. DP-Weights:
- T-stat:
-
-1.055, P-value: 0.292
- Confidence Interval:
-
(-0.04, 0.01)
- Interpretation:
-
There is no significant difference in accuracy between the DP model and the DP-Weights noisy model, indicating similar performance in terms of correct predictions in the membership inference attack.
-
•
DP vs. Fine-tuned:
- T-stat:
-
-3.997, P-value: 8.74e-05
- Confidence Interval:
-
(-0.08, -0.03)
- Interpretation:
-
The DP model has significantly lower accuracy compared to the fine-tuned model, indicating poorer performance in the membership inference attack.
-
•
DP-Weights vs. Fine-tuned:
- T-stat:
-
-2.699, P-value: 0.007
- Confidence Interval:
-
(-0.07, -0.01)
- Interpretation:
-
The DP-Weights noisy model also shows significantly lower accuracy compared to the fine-tuned model.
4.4.5 Precision (Membership Inference Attack)
-
•
DP vs. DP-Weights:
- T-stat:
-
-1.142, P-value: 0.255
- Confidence Interval:
-
(-0.04, 0.01)
- Interpretation:
-
There is no significant difference in precision between the DP model and the DP-Weights noisy model, indicating similar performance in terms of correctly predicted positive instances in the membership inference attack.
-
•
DP vs. Fine-tuned:
- T-stat:
-
-3.996, P-value: 8.77e-05
- Confidence Interval:
-
(-0.08, -0.03)
- Interpretation:
-
The DP model has significantly lower precision compared to the fine-tuned model.
-
•
DP-Weights vs. Fine-tuned:
- T-stat:
-
-2.489, P-value: 0.014
- Confidence Interval:
-
(-0.07, -0.01)
- Interpretation:
-
The DP-Weights noisy model also has significantly lower precision compared to the fine-tuned model.
4.4.6 Recall (Membership Inference Attack)
-
•
DP vs. DP-Weights:
- T-stat:
-
1.150, P-value: 0.252
- Confidence Interval:
-
(-0.01, 0.05)
- Interpretation:
-
There is no significant difference in recall between the DP model and the DP-Weights noisy model, indicating similar performance in terms of identifying all relevant instances in the membership inference attack.
-
•
DP vs. Fine-tuned:
- T-stat:
-
-4.239, P-value: 3.29e-05
- Confidence Interval:
-
(-0.09, -0.03)
- Interpretation:
-
The DP model has significantly lower recall compared to the fine-tuned model.
-
•
DP-Weights vs. Fine-tuned:
- T-stat:
-
-4.685, P-value: 4.87e-06
- Confidence Interval:
-
(-0.11, -0.05)
- Interpretation:
-
The DP-Weights noisy model also has significantly lower recall compared to the fine-tuned model.
4.4.7 F1 Score (Membership Inference Attack)
-
•
DP vs. DP-Weights:
- T-stat:
-
0.234, P-value: 0.815
- Confidence Interval:
-
(-0.02, 0.03)
- Interpretation:
-
There is no significant difference in F1 score between the DP model and the DP-Weights noisy model, indicating similar performance in terms of the balance between precision and recall in the membership inference attack.
-
•
DP vs. Fine-tuned:
- T-stat:
-
-4.225, P-value: 3.49e-05
- Confidence Interval:
-
(-0.09, -0.03)
- Interpretation:
-
The DP model has a significantly lower F1 score compared to the fine-tuned model.
-
•
DP-Weights vs. Fine-tuned:
- T-stat:
-
-3.840, P-value: 0.000160
- Confidence Interval:
-
(-0.09, -0.03)
- Interpretation:
-
The DP-Weights noisy model also has a significantly lower F1 score compared to the fine-tuned model.
4.4.8 Summary of Epoch-based Analysis
The results from the epoch-based analysis support the following conclusions:
-
•
Similarity between DP and DP-Weights Models: The DP model and DP-Weights noisy model show similar performance in most metrics, with no significant differences in perplexity (member), accuracy, precision, recall, and F1 score. The only notable differences are in perplexity (non-member) and ROC AUC, where the DP model performs slightly worse.
-
•
Difference from Fine-tuned Model: Both the DP model and the DP-Weights noisy model are significantly different from the fine-tuned model across all metrics. This indicates that the fine-tuned model consistently outperforms the DP and DP-Weights models in terms of perplexity, ROC AUC, accuracy, precision, recall, and F1 score – indicating the fine-tuned model is more susceptible to membership inference attacks.
These findings suggest that the DP-SGD model can achieve performance (relatively) comparable to the DP-Weights noisy model, with statistically similar behavior in most metrics, while both models are distinct from the fine-tuned model in terms of performance.
![Refer to caption](extracted/5694311/dp_experiment_radar_chart_averaged.png)
5 Discussion
The results of this study provide significant insights into the efficacy of applying differentially private (DP) noise to machine learning models post-training. Our approach was compared against a traditional DP model and a fine-tuned model without DP noise, across various metrics and experimental setups involving different batch sizes and epochs.
5.1 Performance Comparison
Our experiments demonstrated that the novel fine-tuned model (DP-Weights model) with post-training noise application achieves performance metrics statistically similar to those of the traditional DP model. This finding is crucial as it validates the effectiveness of our method in maintaining privacy guarantees while maintaining a similar level of performance degradation compared to the DP-SGD model. This can be particularly advantageous in practical applications where retraining a model is costly or impractical.
5.1.1 Perplexity
The perplexity analysis for both member data points revealed that the DP-Weights model and the DP-SGD model exhibited comparable perplexity scores. For member data points, there was no significant difference between the DP-SGD and DP-Weights models, while both had significantly higher perplexity compared to the fine-tuned model. This suggests that the addition of noise, whether during or post-training, introduces similar levels of uncertainty in the model’s predictions.
For non-member data points, the DP model showed significantly lower perplexity compared to the DP-Weights model, indicating slightly better generalization. However, both models had higher perplexity than the fine-tuned model, highlighting the impact of differential privacy on model performance.
5.1.2 Classification Metrics
In terms of ROC AUC, accuracy, precision, recall, and F1 score, the DP model and the DP-Weights model exhibited similar behavior across different batch sizes and epochs. The significant differences observed were mainly between these models and the fine-tuned model. This indicates that while differential privacy noise impacts model performance, the timing of its application (during or post-training) should not result in substantial performance differences if applied correctly.
5.1.3 DP-Weights Losses and Gains vs. DP-SGD
Looking at the radar plots, there seems to be a tradeoff between DP-Weights and DP-SGD that is not present in the descriptive statistics. This discrepancy is important to note. Visually, it is clear these two approaches are not identical.
-
•
On average across epsilon values from 1 to 1,000, at lower batch sizes like 1 through 5, DP-SGD appears to provide noticeably greater privacy protection and perplexity performance, whereas DP-Weights appears to provide slightly greater privacy protection and perplexity performance at higher batch sizes when holding epochs at 10, dataset size at 1000, learning rate at 5e-5, clip** norm at 1.0.
-
•
On average across epsilon values from 1 to 1,000, DP-SGD and DP-Weights appear to offer similar privacy protections for epochs of values 1, 5, and 10. However, DP-SGD offers greater privacy protection and performance at high numbers of epochs like 20. Both offer statistically greater privacy protection than the fine-tuned model.
This indicates that the batch size term is not correctly accounted for in practice when using the proposed noise scale.
We caution users of DP-Weights to be mindful of these tradeoffs should they choose to incorporate our approach into their privacy protection protocols. While our mathematical proof may provide evidence to support the fact that our approach is Differentially Private, and the statistics do not show any significant difference between DP-SGD and DP-Weights, this does not necessarily mean DP-Weights will offer the same level of privacy protections as other DP approaches to training machine learning models.
5.2 Implications of Post-Training Noise Application
The primary advantage of applying DP noise post-training is the potential for improved training efficiency and model performance tuning. Traditional methods integrating noise during training can lead to significant training costs and require careful tuning of the noise parameters to balance privacy and accuracy. By applying noise post-training, our method simplifies the training process and allows for better optimization of the model’s performance before introducing privacy guarantees.
5.3 Limitations and Future Work
While our study shows promising results, it is essential to consider the limitations and potential areas for future research. One limitation is the assumption that the noise added post-training uniformly affects all model weights. Further investigation is needed to understand the impact of noise distribution and its implications on model performance.
This approach requires a machine learning practitioner to have knowledge of the dataset size, learning rate, batch size, and the gradients must be clipped during training for this approach to be mathematically differentially private. Without clip** the gradients, the global sensitivity of the model weight update rule is not bound by the gradient norm. As such, one must always have knowledge of how a model was trained in order for this approach to provide the guarantees afforded by the Differential Privacy framework.
Additionally, exploring the scalability of our approach to larger models and more complex datasets is a critical next step. Understanding how post-training noise application interacts with various model architectures and training regimes will provide deeper insights into its practical applicability.
6 Conclusion
This study introduces a novel noise scale for applying differential privacy noise to machine learning models post-training. Our analysis across various metrics and experimental setups shows that this approach yields performance statistically similar to that of traditional DP models while simplifying the training process. These findings highlight the potential of post-training noise application as a viable alternative for achieving privacy guarantees in machine learning models.
Future work will focus on refining this method, exploring its applicability to different model architectures, and further understanding the nuances of noise distribution and its impact on model performance.
7 Acknowledgements
I, David Zagardo, would like to thank my dog, Mr. Macaroni, for his support during this process. He has proven an invaluable contributor to my research. ArXiv does not allow for animals as co-authors, but rest assured that the inclusive language used in the aforementioned text refers to both myself and Mr. Macaroni, without whom this publication would not have happened.
References
- [1] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang, ”Deep Learning with Differential Privacy,” Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 2016.
- [2] H. Yu, K. Su, Y. Chen, and T. Goldstein, ”Differentially Private Model Training with Fast Adaptation,” arXiv preprint arXiv:2206.02617, 2022.
- [3] Y. Li, P. Kairouz, and M. Jaggi, ”Large Scale Differentially Private Fine-tuning,” arXiv preprint arXiv:2112.00845, 2022.
- [4] G. Damaskinos, R. Tramèr, and F. B. Shepherd, ”Differentially Private Coordinate Descent,” arXiv preprint arXiv:2112.00845, 2021.
- [5] J. Koskela, A. Bun, and V. Pihur, ”Privacy Amplification by Shuffling: Applications to DP-SGD,” arXiv preprint arXiv:2206.02617, 2023.
- [6] J. Yu, K. Zhao, and A. Srivastava, ”DP-UTIL: Comprehensive Utility Analysis of Differential Privacy in Machine Learning,” arXiv preprint arXiv:2102.07746, 2021.
- [7] H. Dong, J. Hsu, and A. Roth, ”Gaussian Differential Privacy,” Journal of Privacy and Confidentiality, 2022.
- [8] S. Vadhan, ”Differentially Private Training of Deep Neural Networks with Smooth Sensitivity,” arXiv preprint arXiv:2210.05659, 2022.
- [9] M. Nasr, R. Shokri, and A. Houmansadr, ”Comprehensive Privacy Analysis of Deep Learning,” arXiv preprint arXiv:1812.00910, 2018.
- [10] J. Lyu, D. Su, B. Xu, and Y. Wang, ”Differentially Private Model Publishing through Noise Injection,” arXiv preprint arXiv:2006.10218, 2020.
8 Appendix
8.1 Mathematical Rigor and Proof of Differential Privacy
To establish the differential privacy guarantees of our proposed method of applying noise post-training, we follow a systematic approach to prove that the mechanism is indeed differentially private.
8.1.1 Differential Privacy Definition
A randomized mechanism provides -differential privacy if for any two adjacent datasets and (differing by at most one element) and for any set of outcomes :
8.1.2 Step-by-Step Proof
Step 1: Sensitivity Calculation of the Weight Update Rule
Let be the set of trainable weights in the model. The sensitivity is the maximum change in any trainable weight when one data point is added or removed from the training dataset:
where is the number of epochs, is the learning rate, is the clip** norm, is the dataset size, and is the batch size.
Step 2: Noise Scale Calculation
The noise added to each trainable weight is drawn from a Gaussian distribution with zero mean and variance , where:
This ensures the added noise accounts for the empirical term, enhancing privacy protection.
Step 3: Probability Distribution
The probability density function (pdf) of the Gaussian noise is:
Step 4: Adjacent Datasets
Consider two adjacent datasets and that differ by one data point. The corresponding weights after training on these datasets are and , respectively, with for each trainable weight .
Step 5: Privacy Guarantee
The goal is to show:
where is the added Gaussian noise.
The pdfs of the noisy weights are:
The ratio of these probabilities for adjacent datasets is:
Since , we have:
Substituting :
The bound becomes:
To simplify, denote and , giving:
To ensure that:
we need to show:
Substituting and :
Simplifying the bound:
Assuming is relatively small compared to :
Thus:
Ensuring:
Solving for :
Therefore, our mechanism satisfies -differential privacy if is chosen such that:
The inclusion of the empirical term provides an additional conservative noise buffer, ensuring robustness of the privacy guarantee. By combining and , the overall noise scale maintains the differential privacy requirements.
Thus, the mechanism with post-training noise addition conclusively satisfies -differential privacy, providing both theoretical and empirical noise components for enhanced privacy protection.
However, if is not relatively small compared to , we need to reassess the bound more rigorously.
Considering is not relatively small compared to :
The noise scale is:
The bound on the ratio of probabilities for adjacent datasets becomes:
Ensuring:
Thus:
Substituting and :
Expanding the square:
To ensure the bound:
Simplify further:
Rewriting the inequality:
Dividing both sides by 2:
Solving for involves ensuring that the terms on the right-hand side appropriately bound . This guarantees -differential privacy if:
8.1.3 Rényi Differential Privacy and Subsampling Amplification
Rényi Differential Privacy Definition
Define -Rényi divergence between two probability distributions and :
A mechanism is -RDP if for all adjacent datasets and :
Our -sensitivity remains:
For a Gaussian mechanism with noise , the RDP guarantee is:
With a sampling rate (batch size / dataset size), we can use the following bound (Wang et al., 2019):
We need to calibrate to achieve the desired RDP guarantee. Let’s set:
For epochs, we use RDP composition:
Use the conversion theorem (Canonne et al., 2020) to convert RDP to -DP: For any , the mechanism satisfies -DP where:
To account for the empirical term in our noise formula, we add it to our :
Final Privacy Guarantee
The mechanism satisfies -DP where:
By leveraging RDP and subsampling amplification, our mechanism achieves tighter privacy bounds, enhancing the overall privacy guarantee.
Thus, the mechanism with post-training noise addition conclusively satisfies -differential privacy, providing both theoretical and empirical noise components for enhanced privacy protection. This ensures robustness even when is not relatively small compared to .
Step 6: Explicitly Accounting for
The above derivation shows the bound in terms of . To fully account for , we need to ensure that the probability mass beyond the bound is captured by . By designing the noise scale to include the term , we ensure that the probability mass beyond is at most .
8.1.4 Considerations and Assumptions
-
•
Composition: If the method is applied multiple times or combined with other privacy-preserving techniques, the privacy guarantees compose according to the composition theorems of differential privacy.
-
•
Assumptions: The analysis assumes the independence of noise added to different weights.
8.1.5 Proof Summary
By adding Gaussian noise to the model weights post-training with the calculated noise scale, the proposed method ensures differential privacy. This proof demonstrates that the noise addition mechanism adheres to the definition of -differential privacy, providing a rigorous foundation for the approach.
8.1.6 Further Validation
Future work would consider the following:
-
•
Empirical Validation: Conduct experiments with more datasets and more models.
-
•
Simulations: Use synthetic datasets for analytical comparisons.
-
•
Peer Review: Submit for feedback from privacy and machine learning experts.
The comprehensive proof and additional validation steps ensure a robust and reliable method for achieving differential privacy through post-training noise addition.
8.2 Tables and Figures
Batch Size | Model | Perplex. (Mem.) | Perplex. (Non-Mem.) | ROC AUC | Accuracy | Prec. | Recall | F1 Score |
Mean Std | Mean Std | Mean Std | Mean Std | Mean Std | Mean Std | Mean Std | ||
5 | dp_model | |||||||
5 | fine_tuned_model | |||||||
5 | non_dp_model | |||||||
10 | dp_model | |||||||
10 | fine_tuned_model | |||||||
10 | non_dp_model | |||||||
50 | dp_model | |||||||
50 | fine_tuned_model | |||||||
50 | non_dp_model | |||||||
100 | dp_model | |||||||
100 | fine_tuned_model | |||||||
100 | non_dp_model |
Epoch | Model | Perplex. (Mem.) | Perplex. (Non-Mem.) | ROC AUC | Accuracy | Prec. | Recall | F1 Score |
Mean Std | Mean Std | Mean Std | Mean Std | Mean Std | Mean Std | Mean Std | ||
1 | dp_model | |||||||
1 | fine_tuned_model | |||||||
1 | non_dp_model | |||||||
5 | dp_model | |||||||
5 | fine_tuned_model | |||||||
5 | non_dp_model | |||||||
10 | dp_model | |||||||
10 | fine_tuned_model | |||||||
10 | non_dp_model | |||||||
20 | dp_model | |||||||
20 | fine_tuned_model | |||||||
20 | non_dp_model |
![Refer to caption](extracted/5694311/dp_experiment_radar_chart_averaged_1_10.png)
![Refer to caption](extracted/5694311/dp_experiment_radar_chart_averaged_10_100.png)
![Refer to caption](extracted/5694311/dp_experiment_radar_chart_averaged_100_1000.png)
Metric | DP vs DP-Weights | DP vs Fine-Tuned | DP-Weights vs Fine-Tuned |
RMSE for perplexity_member | 12.505 | 26.940 | 36.620 |
RMSE for perplexity_non_member | 12.577 | 26.305 | 36.298 |
RMSE for roc_auc | 0.072 | 0.058 | 0.078 |
RMSE for accuracy | 0.082 | 0.048 | 0.077 |
RMSE for precision | 0.087 | 0.051 | 0.088 |
RMSE for recall | 0.125 | 0.073 | 0.098 |
RMSE for f1 | 0.102 | 0.050 | 0.092 |
Pearson correlation for perplexity_member | 0.902 | nan | nan |
Pearson correlation for perplexity_non_member | 0.905 | nan | nan |
Pearson correlation for roc_auc | 0.091 | nan | nan |
Pearson correlation for accuracy | -0.061 | nan | nan |
Pearson correlation for precision | 0.006 | nan | nan |
Pearson correlation for recall | -0.013 | nan | nan |
Pearson correlation for f1 | -0.037 | nan | nan |
MAE for perplexity_member | 9.420 | 21.949 | 30.785 |
MAE for perplexity_non_member | 9.581 | 21.468 | 30.590 |
MAE for roc_auc | 0.052 | 0.046 | 0.055 |
MAE for accuracy | 0.066 | 0.032 | 0.063 |
MAE for precision | 0.071 | 0.036 | 0.069 |
MAE for recall | 0.086 | 0.039 | 0.068 |
MAE for f1 | 0.079 | 0.035 | 0.069 |
R2 score for perplexity_member | 0.359 | -1.974 | -2.409 |
R2 score for perplexity_non_member | 0.315 | -1.995 | -2.451 |
R2 score for roc_auc | -2.455 | -1.231 | -0.485 |
R2 score for accuracy | -3.449 | -0.542 | -0.238 |
R2 score for precision | -4.036 | -0.750 | -0.269 |
R2 score for recall | -1.998 | -0.022 | -0.069 |
R2 score for f1 | -3.393 | -0.059 | -0.153 |
MSE for perplexity_member | 156.363 | 725.739 | 1341.057 |
MSE for perplexity_non_member | 158.187 | 691.943 | 1317.525 |
MSE for roc_auc | 0.005 | 0.003 | 0.006 |
MSE for accuracy | 0.007 | 0.002 | 0.006 |
MSE for precision | 0.008 | 0.003 | 0.008 |
MSE for recall | 0.016 | 0.005 | 0.010 |
MSE for f1 | 0.010 | 0.003 | 0.008 |
MedAE for perplexity_member | 7.455 | 17.503 | 23.244 |
MedAE for perplexity_non_member | 8.031 | 17.199 | 23.265 |
MedAE for roc_auc | 0.040 | 0.055 | 0.030 |
MedAE for accuracy | 0.050 | 0.025 | 0.050 |
MedAE for precision | 0.056 | 0.022 | 0.056 |
MedAE for recall | 0.100 | 0.000 | 0.100 |
MedAE for f1 | 0.074 | 0.026 | 0.074 |
Coefficient of Variation for perplexity_member | 0.488 | 0.487 | 0.0 |
Coefficient of Variation for perplexity_non_member | 0.481 | 0.481 | 1.685e-16 |
Coefficient of Variation for roc_auc | 0.082 | 0.137 | 2.174e-16 |
Coefficient of Variation for accuracy | 0.076 | 0.137 | 2.056e-16 |
Coefficient of Variation for precision | 0.076 | 0.154 | 0.0 |
Coefficient of Variation for recall | 0.144 | 0.204 | 0.0 |
Coefficient of Variation for f1 | 0.096 | 0.176 | 2.148e-16 |
MAPE for perplexity_member | 0.331 | 0.576 | 0.686 |
MAPE for perplexity_non_member | 0.338 | 0.572 | 0.684 |
MAPE for roc_auc | 0.110 | 0.104 | 0.136 |
MAPE for accuracy | 0.128 | 0.067 | 0.133 |
MAPE for precision | 0.138 | 0.074 | 0.152 |
MAPE for recall | 0.164 | 0.074 | 0.176 |
MAPE for f1 | 0.152 | 0.071 | 0.164 |
Wilcoxon test for perplexity_member | 12.0, | 0.0, | 0.0, |
Wilcoxon test for perplexity_non_member | 9.0, | 0.0, | 0.0, |
Wilcoxon test for roc_auc | 162.0, | 20.5, | 55.0, |
Wilcoxon test for accuracy | 114.5, | 1.0, | 34.0, |
Wilcoxon test for precision | 115.5, | 5.0, | 42.0, |
Wilcoxon test for recall | 36.5, | 16.0, | 36.0, |
Wilcoxon test for f1 | 106.0, | 60.0, | 73.0, |
Metric | DP vs DP-Weights | DP vs Fine-Tuned | DP-Weights vs Fine-Tuned |
RMSE for perplexity_member | 7.123 | 13.296 | 7.165 |
RMSE for perplexity_non_member | 6.595 | 13.011 | 7.575 |
RMSE for roc_auc | 0.079 | 0.033 | 0.076 |
RMSE for accuracy | 0.109 | 0.070 | 0.074 |
RMSE for precision | 0.122 | 0.072 | 0.088 |
RMSE for recall | 0.167 | 0.073 | 0.135 |
RMSE for f1 | 0.145 | 0.070 | 0.112 |
Pearson correlation for perplexity_member | 0.922 | nan | nan |
Pearson correlation for perplexity_non_member | 0.922 | nan | nan |
Pearson correlation for roc_auc | 0.018 | nan | nan |
Pearson correlation for accuracy | -0.100 | nan | nan |
Pearson correlation for precision | -0.082 | nan | nan |
Pearson correlation for recall | -0.229 | nan | nan |
Pearson correlation for f1 | -0.168 | nan | nan |
MAE for perplexity_member | 4.356 | 9.001 | 6.126 |
MAE for perplexity_non_member | 4.126 | 8.774 | 6.479 |
MAE for roc_auc | 0.061 | 0.030 | 0.060 |
MAE for accuracy | 0.093 | 0.063 | 0.059 |
MAE for precision | 0.103 | 0.065 | 0.069 |
MAE for recall | 0.136 | 0.054 | 0.111 |
MAE for f1 | 0.119 | 0.059 | 0.091 |
R2 score for perplexity_member | 0.470 | -0.846 | -2.717 |
R2 score for perplexity_non_member | 0.529 | -0.834 | -2.726 |
R2 score for roc_auc | -4.865 | -0.043 | -0.450 |
R2 score for accuracy | -1.853 | -0.171 | -0.030 |
R2 score for precision | -2.402 | -0.181 | -0.052 |
R2 score for recall | -4.212 | -0.002 | -0.447 |
R2 score for f1 | -3.573 | -0.048 | -0.268 |
MSE for perplexity_member | 50.742 | 176.777 | 51.335 |
MSE for perplexity_non_member | 43.490 | 169.299 | 57.378 |
MSE for roc_auc | 0.006 | 0.001 | 0.006 |
MSE for accuracy | 0.012 | 0.005 | 0.005 |
MSE for precision | 0.015 | 0.005 | 0.008 |
MSE for recall | 0.028 | 0.005 | 0.018 |
MSE for f1 | 0.021 | 0.005 | 0.012 |
MedAE for perplexity_member | 2.159 | 4.802 | 4.431 |
MedAE for perplexity_non_member | 2.356 | 4.613 | 4.718 |
MedAE for roc_auc | 0.045 | 0.030 | 0.065 |
MedAE for accuracy | 0.100 | 0.050 | 0.050 |
MedAE for precision | 0.100 | 0.056 | 0.056 |
MedAE for recall | 0.100 | 0.100 | 0.100 |
MedAE for f1 | 0.103 | 0.067 | 0.079 |
Coefficient of Variation for perplexity_member | 0.799 | 0.394 | 1.303e-16 |
Coefficient of Variation for perplexity_non_member | 0.783 | 0.392 | 1.215e-16 |
Coefficient of Variation for roc_auc | 0.068 | 0.123 | 1.178e-16 |
Coefficient of Variation for accuracy | 0.125 | 0.152 | 0.0 |
Coefficient of Variation for precision | 0.127 | 0.181 | 0.0 |
Coefficient of Variation for recall | 0.148 | 0.269 | 0.0 |
Coefficient of Variation for f1 | 0.134 | 0.225 | 0.0 |
MAPE for perplexity_member | 0.327 | 0.527 | 0.587 |
MAPE for perplexity_non_member | 0.329 | 0.506 | 0.583 |
MAPE for roc_auc | 0.131 | 0.061 | 0.111 |
MAPE for accuracy | 0.176 | 0.119 | 0.125 |
MAPE for precision | 0.194 | 0.123 | 0.157 |
MAPE for recall | 0.261 | 0.110 | 0.339 |
MAPE for f1 | 0.225 | 0.116 | 0.244 |
Wilcoxon test for perplexity_member | 149.0, | 0.0, | 0.0, |
Wilcoxon test for perplexity_non_member | 159.0, | 0.0, | 0.0, |
Wilcoxon test for roc_auc | 94.5, | 126.0, | 62.5, |
Wilcoxon test for accuracy | 91.5, | 62.0, | 103.5, |
Wilcoxon test for precision | 90.5, | 80.0, | 89.0, |
Wilcoxon test for recall | 49.0, | 56.0, | 45.0, |
Wilcoxon test for f1 | 79.5, | 108.0, | 82.0, |
Metric | DP vs DP-Weights | DP vs Fine-Tuned | DP-Weights vs Fine-Tuned |
RMSE for perplexity_member | 7.123 | 13.296 | 7.165 |
RMSE for perplexity_non_member | 6.595 | 13.011 | 7.575 |
RMSE for roc_auc | 0.079 | 0.033 | 0.076 |
RMSE for accuracy | 0.109 | 0.070 | 0.074 |
RMSE for precision | 0.122 | 0.072 | 0.088 |
RMSE for recall | 0.167 | 0.073 | 0.135 |
RMSE for f1 | 0.145 | 0.070 | 0.112 |
Pearson correlation for perplexity_member | 0.922 | nan | nan |
Pearson correlation for perplexity_non_member | 0.922 | nan | nan |
Pearson correlation for roc_auc | 0.018 | nan | nan |
Pearson correlation for accuracy | -0.100 | nan | nan |
Pearson correlation for precision | -0.082 | nan | nan |
Pearson correlation for recall | -0.229 | nan | nan |
Pearson correlation for f1 | -0.168 | nan | nan |
MAE for perplexity_member | 4.356 | 9.001 | 6.126 |
MAE for perplexity_non_member | 4.126 | 8.774 | 6.479 |
MAE for roc_auc | 0.061 | 0.030 | 0.060 |
MAE for accuracy | 0.093 | 0.063 | 0.059 |
MAE for precision | 0.103 | 0.065 | 0.069 |
MAE for recall | 0.136 | 0.054 | 0.111 |
MAE for f1 | 0.119 | 0.059 | 0.091 |
R2 score for perplexity_member | 0.470 | -0.846 | -2.717 |
R2 score for perplexity_non_member | 0.529 | -0.834 | -2.726 |
R2 score for roc_auc | -4.865 | -0.043 | -0.450 |
R2 score for accuracy | -1.853 | -0.171 | -0.030 |
R2 score for precision | -2.402 | -0.181 | -0.052 |
R2 score for recall | -4.212 | -0.002 | -0.447 |
R2 score for f1 | -3.573 | -0.048 | -0.268 |
MSE for perplexity_member | 50.742 | 176.777 | 51.335 |
MSE for perplexity_non_member | 43.490 | 169.299 | 57.378 |
MSE for roc_auc | 0.006 | 0.001 | 0.006 |
MSE for accuracy | 0.012 | 0.005 | 0.005 |
MSE for precision | 0.015 | 0.005 | 0.008 |
MSE for recall | 0.028 | 0.005 | 0.018 |
MSE for f1 | 0.021 | 0.005 | 0.012 |
MedAE for perplexity_member | 2.159 | 4.802 | 4.431 |
MedAE for perplexity_non_member | 2.356 | 4.613 | 4.718 |
MedAE for roc_auc | 0.045 | 0.030 | 0.065 |
MedAE for accuracy | 0.100 | 0.050 | 0.050 |
MedAE for precision | 0.100 | 0.056 | 0.056 |
MedAE for recall | 0.100 | 0.100 | 0.100 |
MedAE for f1 | 0.103 | 0.067 | 0.079 |
Coefficient of Variation for perplexity_member | 0.799 | 0.394 | 1.303e-16 |
Coefficient of Variation for perplexity_non_member | 0.783 | 0.392 | 1.215e-16 |
Coefficient of Variation for roc_auc | 0.068 | 0.123 | 1.178e-16 |
Coefficient of Variation for accuracy | 0.125 | 0.152 | 0.0 |
Coefficient of Variation for precision | 0.127 | 0.181 | 0.0 |
Coefficient of Variation for recall | 0.148 | 0.269 | 0.0 |
Coefficient of Variation for f1 | 0.134 | 0.225 | 0.0 |
MAPE for perplexity_member | 0.327 | 0.527 | 0.587 |
MAPE for perplexity_non_member | 0.329 | 0.506 | 0.583 |
MAPE for roc_auc | 0.131 | 0.061 | 0.111 |
MAPE for accuracy | 0.176 | 0.119 | 0.125 |
MAPE for precision | 0.194 | 0.123 | 0.157 |
MAPE for recall | 0.261 | 0.110 | 0.339 |
MAPE for f1 | 0.225 | 0.116 | 0.244 |
Wilcoxon test for perplexity_member | 149.0, 0.227 | 0.0, | 0.0, |
Wilcoxon test for perplexity_non_member | 159.0, 0.327 | 0.0, | 0.0, |
Wilcoxon test for roc_auc | 94.5, 0.023 | 126.0, 0.206 | 62.5, 0.002 |
Wilcoxon test for accuracy | 91.5, 0.032 | 62.0, 0.006 | 103.5, 0.672 |
Wilcoxon test for precision | 90.5, 0.030 | 80.0, 0.023 | 89.0, 0.352 |
Wilcoxon test for recall | 49.0, 0.011 | 56.0, 0.796 | 45.0, 0.003 |
Wilcoxon test for f1 | 79.5, 0.015 | 108.0, 0.084 | 82.0, 0.051 |
Metric | DP vs DP-Weights | DP vs Fine-Tuned | DP-Weights vs Fine-Tuned |
RMSE for perplexity_member | 3.197 | 7.225 | 6.567 |
RMSE for perplexity_non_member | 3.577 | 6.896 | 7.528 |
RMSE for roc_auc | 0.081 | 0.025 | 0.073 |
RMSE for accuracy | 0.081 | 0.053 | 0.070 |
RMSE for precision | 0.088 | 0.053 | 0.073 |
RMSE for recall | 0.115 | 0.080 | 0.145 |
RMSE for f1 | 0.090 | 0.061 | 0.099 |
Pearson correlation for perplexity_member | 0.896 | nan | nan |
Pearson correlation for perplexity_non_member | 0.887 | nan | nan |
Pearson correlation for roc_auc | 0.003 | nan | nan |
Pearson correlation for accuracy | 0.154 | nan | nan |
Pearson correlation for precision | 0.073 | nan | nan |
Pearson correlation for recall | 0.273 | nan | nan |
Pearson correlation for f1 | 0.283 | nan | nan |
MAE for perplexity_member | 2.573 | 4.447 | 5.560 |
MAE for perplexity_non_member | 3.169 | 4.076 | 6.378 |
MAE for roc_auc | 0.070 | 0.016 | 0.062 |
MAE for accuracy | 0.059 | 0.039 | 0.055 |
MAE for precision | 0.067 | 0.041 | 0.061 |
MAE for recall | 0.082 | 0.050 | 0.118 |
MAE for f1 | 0.066 | 0.044 | 0.081 |
R2 score for perplexity_member | 0.685 | -0.610 | -2.532 |
R2 score for perplexity_non_member | 0.587 | -0.537 | -2.543 |
R2 score for roc_auc | -10.510 | -0.098 | -0.456 |
R2 score for accuracy | -1.291 | -0.004 | -0.123 |
R2 score for precision | -1.680 | -0.0002 | -0.039 |
R2 score for recall | -2.364 | -0.636 | -1.391 |
R2 score for f1 | -1.601 | -0.184 | -0.769 |
MSE for perplexity_member | 10.222 | 52.200 | 43.127 |
MSE for perplexity_non_member | 12.792 | 47.557 | 56.675 |
MSE for roc_auc | 0.007 | 0.001 | 0.005 |
MSE for accuracy | 0.007 | 0.003 | 0.005 |
MSE for precision | 0.008 | 0.003 | 0.005 |
MSE for recall | 0.013 | 0.006 | 0.021 |
MSE for f1 | 0.008 | 0.004 | 0.010 |
MedAE for perplexity_member | 2.284 | 1.442 | 3.925 |
MedAE for perplexity_non_member | 2.885 | 1.023 | 4.564 |
MedAE for roc_auc | 0.065 | 0.010 | 0.050 |
MedAE for accuracy | 0.050 | 0.050 | 0.050 |
MedAE for precision | 0.055 | 0.045 | 0.050 |
MedAE for recall | 0.100 | 0.000 | 0.100 |
MedAE for f1 | 0.048 | 0.029 | 0.071 |
Coefficient of Variation for perplexity_member | 0.781 | 0.417 | 0.0 |
Coefficient of Variation for perplexity_non_member | 0.746 | 0.412 | 0.0 |
Coefficient of Variation for roc_auc | 0.050 | 0.115 | 0.0 |
Coefficient of Variation for accuracy | 0.099 | 0.128 | 2.056e-16 |
Coefficient of Variation for precision | 0.100 | 0.137 | 2.073e-16 |
Coefficient of Variation for recall | 0.116 | 0.195 | 0.0 |
Coefficient of Variation for f1 | 0.104 | 0.151 | 0.0 |
MAPE for perplexity_member | 0.466 | 0.399 | 0.596 |
MAPE for perplexity_non_member | 0.570 | 0.337 | 0.589 |
MAPE for roc_auc | 0.145 | 0.034 | 0.112 |
MAPE for accuracy | 0.112 | 0.077 | 0.114 |
MAPE for precision | 0.127 | 0.080 | 0.123 |
MAPE for recall | 0.147 | 0.107 | 0.287 |
MAPE for f1 | 0.122 | 0.090 | 0.181 |
Wilcoxon test for perplexity_member | 77.0, | 0.0, | 0.0, |
Wilcoxon test for perplexity_non_member | 49.0, | 0.0, | 0.0, |
Wilcoxon test for roc_auc | 61.0, | 81.0, | 52.5, |
Wilcoxon test for accuracy | 51.0, | 55.0, | 42.0, |
Wilcoxon test for precision | 103.5, | 98.0, | 138.0, |
Wilcoxon test for recall | 19.5, | 0.0, | 7.0, |
Wilcoxon test for f1 | 57.5, | 55.0, | 34.0, |
Metric | DP vs DP-Weights | DP vs Fine-Tuned | DP-Weights vs Fine-Tuned |
RMSE for perplexity_member | 1.105 | 1.375 | 2.320 |
RMSE for perplexity_non_member | 4.430 | 0.483 | 4.327 |
RMSE for roc_auc | 0.222 | 0.348 | 0.151 |
RMSE for accuracy | 0.202 | 0.334 | 0.181 |
RMSE for precision | 0.202 | 0.304 | 0.135 |
RMSE for recall | 0.218 | 0.380 | 0.247 |
RMSE for f1 | 0.206 | 0.341 | 0.189 |
Pearson correlation for perplexity_member | 0.948 | nan | nan |
Pearson correlation for perplexity_non_member | 0.362 | nan | nan |
Pearson correlation for roc_auc | 0.706 | nan | nan |
Pearson correlation for accuracy | 0.386 | nan | nan |
Pearson correlation for precision | 0.427 | nan | nan |
Pearson correlation for recall | 0.330 | nan | nan |
Pearson correlation for f1 | 0.382 | nan | nan |
MAE for perplexity_member | 1.045 | 1.001 | 2.046 |
MAE for perplexity_non_member | 3.994 | 0.334 | 3.819 |
MAE for roc_auc | 0.194 | 0.282 | 0.144 |
MAE for accuracy | 0.182 | 0.277 | 0.173 |
MAE for precision | 0.181 | 0.248 | 0.129 |
MAE for recall | 0.182 | 0.314 | 0.232 |
MAE for f1 | 0.181 | 0.280 | 0.180 |
R2 score for perplexity_member | -0.375 | -1.129 | -3.498 |
R2 score for perplexity_non_member | -95.737 | -0.151 | -3.526 |
R2 score for roc_auc | -0.196 | -1.928 | -9.677 |
R2 score for accuracy | -0.171 | -2.195 | -11.602 |
R2 score for precision | -0.313 | -1.974 | -10.019 |
R2 score for recall | -0.044 | -2.170 | -7.504 |
R2 score for f1 | -0.129 | -2.091 | -8.940 |
MSE for perplexity_member | 1.221 | 1.890 | 5.380 |
MSE for perplexity_non_member | 19.629 | 0.233 | 18.723 |
MSE for roc_auc | 0.049 | 0.121 | 0.023 |
MSE for accuracy | 0.041 | 0.112 | 0.033 |
MSE for precision | 0.041 | 0.093 | 0.018 |
MSE for recall | 0.048 | 0.144 | 0.061 |
MSE for f1 | 0.042 | 0.116 | 0.036 |
MedAE for perplexity_member | 0.943 | 0.774 | 1.623 |
MedAE for perplexity_non_member | 3.481 | 0.332 | 3.001 |
MedAE for roc_auc | 0.165 | 0.320 | 0.135 |
MedAE for accuracy | 0.200 | 0.325 | 0.150 |
MedAE for precision | 0.164 | 0.291 | 0.109 |
MedAE for recall | 0.200 | 0.300 | 0.200 |
MedAE for f1 | 0.183 | 0.301 | 0.152 |
Coefficient of Variation for perplexity_member | 0.344 | 0.290 | 0.0 |
Coefficient of Variation for perplexity_non_member | 0.119 | 0.264 | 0.0 |
Coefficient of Variation for roc_auc | 0.301 | 0.057 | 1.166e-16 |
Coefficient of Variation for accuracy | 0.283 | 0.067 | 1.190e-16 |
Coefficient of Variation for precision | 0.272 | 0.053 | 1.244e-16 |
Coefficient of Variation for recall | 0.317 | 0.112 | 0.0 |
Coefficient of Variation for f1 | 0.294 | 0.079 | 0.0 |
MAPE for perplexity_member | 0.397 | 0.293 | 0.498 |
MAPE for perplexity_non_member | 1.040 | 0.084 | 0.453 |
MAPE for roc_auc | 0.352 | 0.537 | 0.178 |
MAPE for accuracy | 0.318 | 0.520 | 0.229 |
MAPE for precision | 0.327 | 0.476 | 0.168 |
MAPE for recall | 0.331 | 0.621 | 0.321 |
MAPE for f1 | 0.327 | 0.545 | 0.241 |
Wilcoxon test for perplexity_member | 0.0, | 0.0, | 0.0, |
Wilcoxon test for perplexity_non_member | 0.0, | 43.0, | 0.0, |
Wilcoxon test for roc_auc | 61.0, | 0.0, | 0.0, |
Wilcoxon test for accuracy | 64.5, | 0.0, | 0.0, |
Wilcoxon test for precision | 57.0, | 0.0, | 0.0, |
Wilcoxon test for recall | 79.0, | 0.0, | 0.0, |
Wilcoxon test for f1 | 78.0, | 0.0, | 0.0, |
Metric | DP vs DP-Weights | DP vs Fine-Tuned | DP-Weights vs Fine-Tuned |
RMSE for perplexity_member | 1.602 | 1.833 | 3.032 |
RMSE for perplexity_non_member | 3.979 | 1.067 | 4.467 |
RMSE for roc_auc | 0.172 | 0.280 | 0.138 |
RMSE for accuracy | 0.150 | 0.240 | 0.138 |
RMSE for precision | 0.175 | 0.245 | 0.122 |
RMSE for recall | 0.138 | 0.249 | 0.195 |
RMSE for f1 | 0.147 | 0.247 | 0.160 |
Pearson correlation for perplexity_member | 0.885 | nan | nan |
Pearson correlation for perplexity_non_member | 0.688 | nan | nan |
Pearson correlation for roc_auc | 0.335 | nan | nan |
Pearson correlation for accuracy | 0.157 | nan | nan |
Pearson correlation for precision | 0.030 | nan | nan |
Pearson correlation for recall | 0.290 | nan | nan |
Pearson correlation for f1 | 0.203 | nan | nan |
MAE for perplexity_member | 1.493 | 1.228 | 2.685 |
MAE for perplexity_non_member | 3.649 | 0.482 | 3.939 |
MAE for roc_auc | 0.155 | 0.243 | 0.133 |
MAE for accuracy | 0.107 | 0.214 | 0.121 |
MAE for precision | 0.128 | 0.216 | 0.100 |
MAE for recall | 0.089 | 0.214 | 0.175 |
MAE for f1 | 0.095 | 0.216 | 0.141 |
R2 score for perplexity_member | -0.386 | -0.815 | -3.644 |
R2 score for perplexity_non_member | -14.029 | -0.080 | -3.499 |
R2 score for roc_auc | -0.511 | -3.026 | -14.274 |
R2 score for accuracy | -0.913 | -3.905 | -3.380 |
R2 score for precision | -1.337 | -3.545 | -2.143 |
R2 score for recall | -0.167 | -2.830 | -4.035 |
R2 score for f1 | -0.547 | -3.334 | -3.526 |
MSE for perplexity_member | 2.565 | 3.359 | 9.190 |
MSE for perplexity_non_member | 15.831 | 1.138 | 19.954 |
MSE for roc_auc | 0.029 | 0.078 | 0.019 |
MSE for accuracy | 0.022 | 0.058 | 0.019 |
MSE for precision | 0.031 | 0.060 | 0.015 |
MSE for recall | 0.019 | 0.062 | 0.038 |
MSE for f1 | 0.022 | 0.061 | 0.026 |
MedAE for perplexity_member | 1.279 | 0.839 | 2.118 |
MedAE for perplexity_non_member | 3.251 | 0.187 | 3.027 |
MedAE for roc_auc | 0.155 | 0.310 | 0.135 |
MedAE for accuracy | 0.050 | 0.250 | 0.150 |
MedAE for precision | 0.083 | 0.244 | 0.117 |
MedAE for recall | 0.100 | 0.200 | 0.200 |
MedAE for f1 | 0.058 | 0.229 | 0.168 |
Coefficient of Variation for perplexity_member | 0.412 | 0.297 | 2.123e-16 |
Coefficient of Variation for perplexity_non_member | 0.259 | 0.279 | 1.206e-16 |
Coefficient of Variation for roc_auc | 0.238 | 0.051 | 0.0 |
Coefficient of Variation for accuracy | 0.189 | 0.099 | 1.413e-16 |
Coefficient of Variation for precision | 0.200 | 0.100 | 1.413e-16 |
Coefficient of Variation for recall | 0.221 | 0.142 | 1.413e-16 |
Coefficient of Variation for f1 | 0.207 | 0.116 | 1.413e-16 |
MAPE for perplexity_member | 0.480 | 0.290 | 0.522 |
MAPE for perplexity_non_member | 0.914 | 0.089 | 0.477 |
MAPE for roc_auc | 0.294 | 0.476 | 0.191 |
MAPE for accuracy | 0.221 | 0.417 | 0.190 |
MAPE for precision | 0.269 | 0.429 | 0.155 |
MAPE for recall | 0.203 | 0.453 | 0.305 |
MAPE for f1 | 0.213 | 0.441 | 0.230 |
Wilcoxon test for perplexity_member | 1.0, | 0.0, | 0.0, |
Wilcoxon test for perplexity_non_member | 0.0, | 192.0, | 0.0, |
Wilcoxon test for roc_auc | 41.5, | 0.0, | 0.0, |
Wilcoxon test for accuracy | 20.5, | 0.0, | 0.0, |
Wilcoxon test for precision | 14.0, | 0.0, | 0.0, |
Wilcoxon test for recall | 42.0, | 0.0, | 0.0, |
Wilcoxon test for f1 | 48.0, | 0.0, | 0.0, |
Metric | DP vs DP-Weights | DP vs Fine-Tuned | DP-Weights vs Fine-Tuned |
RMSE for perplexity_member | 3.197 | 7.225 | 6.567 |
RMSE for perplexity_non_member | 3.577 | 6.896 | 7.528 |
RMSE for roc_auc | 0.081 | 0.025 | 0.073 |
RMSE for accuracy | 0.081 | 0.053 | 0.070 |
RMSE for precision | 0.088 | 0.053 | 0.073 |
RMSE for recall | 0.115 | 0.080 | 0.145 |
RMSE for f1 | 0.090 | 0.061 | 0.099 |
Pearson correlation for perplexity_member | 0.896 | nan | nan |
Pearson correlation for perplexity_non_member | 0.887 | nan | nan |
Pearson correlation for roc_auc | 0.003 | nan | nan |
Pearson correlation for accuracy | 0.154 | nan | nan |
Pearson correlation for precision | 0.073 | nan | nan |
Pearson correlation for recall | 0.273 | nan | nan |
Pearson correlation for f1 | 0.283 | nan | nan |
MAE for perplexity_member | 2.573 | 4.447 | 5.560 |
MAE for perplexity_non_member | 3.169 | 4.076 | 6.378 |
MAE for roc_auc | 0.070 | 0.016 | 0.062 |
MAE for accuracy | 0.059 | 0.039 | 0.055 |
MAE for precision | 0.067 | 0.041 | 0.061 |
MAE for recall | 0.082 | 0.050 | 0.118 |
MAE for f1 | 0.066 | 0.044 | 0.081 |
R2 score for perplexity_member | 0.685 | -0.610 | -2.532 |
R2 score for perplexity_non_member | 0.587 | -0.537 | -2.543 |
R2 score for roc_auc | -10.510 | -0.098 | -0.456 |
R2 score for accuracy | -1.291 | -0.004 | -0.123 |
R2 score for precision | -1.680 | -0.000 | -0.039 |
R2 score for recall | -2.364 | -0.636 | -1.391 |
R2 score for f1 | -1.601 | -0.184 | -0.769 |
MSE for perplexity_member | 10.222 | 52.200 | 43.127 |
MSE for perplexity_non_member | 12.792 | 47.557 | 56.675 |
MSE for roc_auc | 0.007 | 0.001 | 0.005 |
MSE for accuracy | 0.007 | 0.003 | 0.005 |
MSE for precision | 0.008 | 0.003 | 0.005 |
MSE for recall | 0.013 | 0.006 | 0.021 |
MSE for f1 | 0.008 | 0.004 | 0.010 |
MedAE for perplexity_member | 2.284 | 1.442 | 3.925 |
MedAE for perplexity_non_member | 2.885 | 1.023 | 4.564 |
MedAE for roc_auc | 0.065 | 0.010 | 0.050 |
MedAE for accuracy | 0.050 | 0.050 | 0.050 |
MedAE for precision | 0.055 | 0.045 | 0.050 |
MedAE for recall | 0.100 | 0.000 | 0.100 |
MedAE for f1 | 0.048 | 0.029 | 0.071 |
Coefficient of Variation for perplexity_member | 0.781 | 0.417 | 0.0 |
Coefficient of Variation for perplexity_non_member | 0.746 | 0.412 | 0.0 |
Coefficient of Variation for roc_auc | 0.050 | 0.115 | 0.0 |
Coefficient of Variation for accuracy | 0.099 | 0.128 | 2.056e-16 |
Coefficient of Variation for precision | 0.100 | 0.137 | 2.073e-16 |
Coefficient of Variation for recall | 0.116 | 0.195 | 0.0 |
Coefficient of Variation for f1 | 0.104 | 0.151 | 0.0 |
MAPE for perplexity_member | 0.466 | 0.399 | 0.596 |
MAPE for perplexity_non_member | 0.570 | 0.337 | 0.589 |
MAPE for roc_auc | 0.145 | 0.034 | 0.112 |
MAPE for accuracy | 0.112 | 0.077 | 0.114 |
MAPE for precision | 0.127 | 0.080 | 0.123 |
MAPE for recall | 0.147 | 0.107 | 0.287 |
MAPE for f1 | 0.122 | 0.090 | 0.181 |
Wilcoxon test for perplexity_member | 77.0, | 0.0, | 0.0, |
Wilcoxon test for perplexity_non_member | 49.0, | 0.0, | 0.0, |
Wilcoxon test for roc_auc | 61.0, | 81.0, | 52.5, |
Wilcoxon test for accuracy | 51.0, | 55.0, | 42.0, |
Wilcoxon test for precision | 103.5, | 98.0, | 138.0, |
Wilcoxon test for recall | 19.5, | 0.0, | 7.0, |
Wilcoxon test for f1 | 57.5, | 55.0, | 34.0, |
Metric | DP vs DP-Weights | DP vs Fine-Tuned | DP-Weights vs Fine-Tuned |
RMSE for perplexity_member | 7.123 | 13.296 | 7.165 |
RMSE for perplexity_non_member | 6.595 | 13.011 | 7.575 |
RMSE for roc_auc | 0.079 | 0.033 | 0.076 |
RMSE for accuracy | 0.109 | 0.070 | 0.074 |
RMSE for precision | 0.122 | 0.072 | 0.088 |
RMSE for recall | 0.167 | 0.073 | 0.135 |
RMSE for f1 | 0.145 | 0.070 | 0.112 |
Pearson correlation for perplexity_member | 0.922 | nan | nan |
Pearson correlation for perplexity_non_member | 0.922 | nan | nan |
Pearson correlation for roc_auc | 0.018 | nan | nan |
Pearson correlation for accuracy | -0.100 | nan | nan |
Pearson correlation for precision | -0.082 | nan | nan |
Pearson correlation for recall | -0.229 | nan | nan |
Pearson correlation for f1 | -0.168 | nan | nan |
MAE for perplexity_member | 4.356 | 9.001 | 6.126 |
MAE for perplexity_non_member | 4.126 | 8.774 | 6.479 |
MAE for roc_auc | 0.061 | 0.030 | 0.060 |
MAE for accuracy | 0.093 | 0.063 | 0.059 |
MAE for precision | 0.103 | 0.065 | 0.069 |
MAE for recall | 0.136 | 0.054 | 0.111 |
MAE for f1 | 0.119 | 0.059 | 0.091 |
R2 score for perplexity_member | 0.470 | -0.846 | -2.717 |
R2 score for perplexity_non_member | 0.529 | -0.834 | -2.726 |
R2 score for roc_auc | -4.865 | -0.043 | -0.450 |
R2 score for accuracy | -1.853 | -0.171 | -0.030 |
R2 score for precision | -2.402 | -0.181 | -0.052 |
R2 score for recall | -4.212 | -0.002 | -0.447 |
R2 score for f1 | -3.573 | -0.048 | -0.268 |
MSE for perplexity_member | 50.742 | 176.777 | 51.335 |
MSE for perplexity_non_member | 43.490 | 169.299 | 57.378 |
MSE for roc_auc | 0.006 | 0.001 | 0.006 |
MSE for accuracy | 0.012 | 0.005 | 0.005 |
MSE for precision | 0.015 | 0.005 | 0.008 |
MSE for recall | 0.028 | 0.005 | 0.018 |
MSE for f1 | 0.021 | 0.005 | 0.012 |
MedAE for perplexity_member | 2.159 | 4.802 | 4.431 |
MedAE for perplexity_non_member | 2.356 | 4.613 | 4.718 |
MedAE for roc_auc | 0.045 | 0.030 | 0.065 |
MedAE for accuracy | 0.100 | 0.050 | 0.050 |
MedAE for precision | 0.100 | 0.056 | 0.056 |
MedAE for recall | 0.100 | 0.100 | 0.100 |
MedAE for f1 | 0.103 | 0.067 | 0.079 |
Coefficient of Variation for perplexity_member | 0.799 | 0.394 | 1.303e-16 |
Coefficient of Variation for perplexity_non_member | 0.783 | 0.392 | 1.215e-16 |
Coefficient of Variation for roc_auc | 0.068 | 0.123 | 1.178e-16 |
Coefficient of Variation for accuracy | 0.125 | 0.152 | 0.0 |
Coefficient of Variation for precision | 0.127 | 0.181 | 0.0 |
Coefficient of Variation for recall | 0.148 | 0.269 | 0.0 |
Coefficient of Variation for f1 | 0.134 | 0.225 | 0.0 |
MAPE for perplexity_member | 0.327 | 0.527 | 0.587 |
MAPE for perplexity_non_member | 0.329 | 0.506 | 0.583 |
MAPE for roc_auc | 0.131 | 0.061 | 0.111 |
MAPE for accuracy | 0.176 | 0.119 | 0.125 |
MAPE for precision | 0.194 | 0.123 | 0.157 |
MAPE for recall | 0.261 | 0.110 | 0.339 |
MAPE for f1 | 0.225 | 0.116 | 0.244 |
Wilcoxon test for perplexity_member | 149.0, | 0.0, | 0.0, |
Wilcoxon test for perplexity_non_member | 159.0, | 0.0, | 0.0, |
Wilcoxon test for roc_auc | 94.5, | 126.0, | 62.5, |
Wilcoxon test for accuracy | 91.5, | 62.0, | 103.5, |
Wilcoxon test for precision | 90.5, | 80.0, | 89.0, |
Wilcoxon test for recall | 49.0, | 56.0, | 45.0, |
Wilcoxon test for f1 | 79.5, | 108.0, | 82.0, |