Search | arXiv e-print repository

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Authors: Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Adam Khoja, Zhenqi Zhao, Ariel Herbert-Voss, Cort B. Breuer , et al. (32 additional authors not shown)

Abstract: The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in develo** biological, cyber, and chemical weapons. To measure these risks of malicious use, government institutions and major AI labs are develo** evaluations for hazardous capabilities in LLMs. However, current evaluations are private, preventing furthe… ▽ More The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in develo** biological, cyber, and chemical weapons. To measure these risks of malicious use, government institutions and major AI labs are develo** evaluations for hazardous capabilities in LLMs. However, current evaluations are private, preventing further research into mitigating risk. Furthermore, they focus on only a few, highly specific pathways for malicious use. To fill these gaps, we publicly release the Weapons of Mass Destruction Proxy (WMDP) benchmark, a dataset of 3,668 multiple-choice questions that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security. WMDP was developed by a consortium of academics and technical consultants, and was stringently filtered to eliminate sensitive information prior to public release. WMDP serves two roles: first, as an evaluation for hazardous knowledge in LLMs, and second, as a benchmark for unlearning methods to remove such hazardous knowledge. To guide progress on unlearning, we develop RMU, a state-of-the-art unlearning method based on controlling model representations. RMU reduces model performance on WMDP while maintaining general capabilities in areas such as biology and computer science, suggesting that unlearning may be a concrete path towards reducing malicious use from LLMs. We release our benchmark and code publicly at https://wmdp.ai △ Less

Submitted 15 May, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

Comments: See the project page at https://wmdp.ai

arXiv:2306.07544 [pdf, other]

On Achieving Optimal Adversarial Test Error

Authors: Justin D. Li, Matus Telgarsky

Abstract: We first elucidate various fundamental properties of optimal adversarial predictors: the structure of optimal adversarial convex predictors in terms of optimal adversarial zero-one predictors, bounds relating the adversarial convex loss to the adversarial zero-one loss, and the fact that continuous predictors can get arbitrarily close to the optimal adversarial error for both convex and zero-one l… ▽ More We first elucidate various fundamental properties of optimal adversarial predictors: the structure of optimal adversarial convex predictors in terms of optimal adversarial zero-one predictors, bounds relating the adversarial convex loss to the adversarial zero-one loss, and the fact that continuous predictors can get arbitrarily close to the optimal adversarial error for both convex and zero-one losses. Applying these results along with new Rademacher complexity bounds for adversarial training near initialization, we prove that for general data distributions and perturbation sets, adversarial training on shallow networks with early stop** and an idealized optimal adversary is able to achieve optimal adversarial test error. By contrast, prior theoretical work either considered specialized data distributions or only provided training error guarantees. △ Less

Submitted 28 April, 2024; v1 submitted 13 June, 2023; originally announced June 2023.

Comments: ICLR 2023; bugs fixed

arXiv:2106.05932 [pdf, other]

Early-stopped neural networks are consistent

Authors: Ziwei Ji, Justin D. Li, Matus Telgarsky

Abstract: This work studies the behavior of shallow ReLU networks trained with the logistic loss via gradient descent on binary classification data where the underlying data distribution is general, and the (optimal) Bayes risk is not necessarily zero. In this setting, it is shown that gradient descent with early stop** achieves population risk arbitrarily close to optimal in terms of not just logistic an… ▽ More This work studies the behavior of shallow ReLU networks trained with the logistic loss via gradient descent on binary classification data where the underlying data distribution is general, and the (optimal) Bayes risk is not necessarily zero. In this setting, it is shown that gradient descent with early stop** achieves population risk arbitrarily close to optimal in terms of not just logistic and misclassification losses, but also in terms of calibration, meaning the sigmoid map** of its outputs approximates the true underlying conditional distribution arbitrarily finely. Moreover, the necessary iteration, sample, and architectural complexities of this analysis all scale naturally with a certain complexity measure of the true conditional model. Lastly, while it is not shown that early stop** is necessary, it is shown that any univariate classifier satisfying a local interpolation property is inconsistent. △ Less

Submitted 4 November, 2021; v1 submitted 10 June, 2021; originally announced June 2021.

arXiv:1702.04272 [pdf, other]

doi 10.1016/j.jvolgeores.2016.10.003

Computer Aided Detection of Transient Inflation Events at Alaskan Volcanoes using GPS Measurements from 2005-2015

Authors: Justin D Li, Cody M Rude, David M Blair, Michael G Gowanlock, Thomas A Herring, Victor Pankratius

Abstract: Analysis of transient deformation events in time series data observed via networks of continuous Global Positioning System (GPS) ground stations provide insight into the magmatic and tectonic processes that drive volcanic activity. Typical analyses of spatial positions originating from each station require careful tuning of algorithmic parameters and selection of time and spatial regions of intere… ▽ More Analysis of transient deformation events in time series data observed via networks of continuous Global Positioning System (GPS) ground stations provide insight into the magmatic and tectonic processes that drive volcanic activity. Typical analyses of spatial positions originating from each station require careful tuning of algorithmic parameters and selection of time and spatial regions of interest to observe possible transient events. This iterative, manual process is tedious when attempting to make new discoveries and does not easily scale with the number of stations. Addressing this challenge, we introduce a novel approach based on a computer-aided discovery system that facilitates the discovery of such potential transient events. The advantages of this approach are demonstrated by actual detections of transient deformation events at volcanoes selected from the Alaska Volcano Observatory database using data recorded by GPS stations from the Plate Boundary Observatory network. Our technique successfully reproduces the analysis of a transient signal detected in the first half of 2008 at Akutan volcano and is also directly applicable to 3 additional volcanoes in Alaska, with the new detection of 2 previously unnoticed inflation events: in early 2011 at Westdahl and in early 2013 at Shishaldin. This study also discusses the benefits of our computer-aided discovery approach for volcanology in general. Advantages include the rapid analysis on multi-scale resolutions of transient deformation events at a large number of sites of interest and the capability to enhance reusability and reproducibility in volcano studies. △ Less

Submitted 14 February, 2017; originally announced February 2017.

Comments: Published in the Journal of Volcanology and Geothermal Research. 9 pages, 7 figures

Journal ref: Journal of Volcanology and Geothermal Research, 327, 634-642

Showing 1–4 of 4 results for author: Li, J D