Search | arXiv e-print repository

CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models

Authors: Manish Bhatt, Sahana Chennabasappa, Yue Li, Cyrus Nikolaidis, Daniel Song, Shengye Wan, Faizan Ahmad, Cornelius Aschermann, Yaohui Chen, Dhaval Kapil, David Molnar, Spencer Whitman, Joshua Saxe

Abstract: Large language models (LLMs) introduce new security risks, but there are few comprehensive evaluation suites to measure and reduce these risks. We present BenchmarkName, a novel benchmark to quantify LLM security risks and capabilities. We introduce two new areas for testing: prompt injection and code interpreter abuse. We evaluated multiple state-of-the-art (SOTA) LLMs, including GPT-4, Mistral,… ▽ More Large language models (LLMs) introduce new security risks, but there are few comprehensive evaluation suites to measure and reduce these risks. We present BenchmarkName, a novel benchmark to quantify LLM security risks and capabilities. We introduce two new areas for testing: prompt injection and code interpreter abuse. We evaluated multiple state-of-the-art (SOTA) LLMs, including GPT-4, Mistral, Meta Llama 3 70B-Instruct, and Code Llama. Our results show that conditioning away risk of attack remains an unsolved problem; for example, all tested models showed between 26% and 41% successful prompt injection tests. We further introduce the safety-utility tradeoff: conditioning an LLM to reject unsafe prompts can cause the LLM to falsely reject answering benign prompts, which lowers utility. We propose quantifying this tradeoff using False Refusal Rate (FRR). As an illustration, we introduce a novel test set to quantify FRR for cyberattack helpfulness risk. We find many LLMs able to successfully comply with "borderline" benign requests while still rejecting most unsafe requests. Finally, we quantify the utility of LLMs for automating a core cybersecurity task, that of exploiting software vulnerabilities. This is important because the offensive capabilities of LLMs are of intense interest; we quantify this by creating novel test sets for four representative problems. We find that models with coding capabilities perform better than those without, but that further work is needed for LLMs to become proficient at exploit generation. Our code is open source and can be used to evaluate other LLMs. △ Less

Submitted 19 April, 2024; originally announced April 2024.

arXiv:2312.04724 [pdf, other]

Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models

Authors: Manish Bhatt, Sahana Chennabasappa, Cyrus Nikolaidis, Shengye Wan, Ivan Evtimov, Dominik Gabi, Daniel Song, Faizan Ahmad, Cornelius Aschermann, Lorenzo Fontana, Sasha Frolov, Ravi Prakash Giri, Dhaval Kapil, Yiannis Kozyrakis, David LeBlanc, James Milazzo, Aleksandar Straumann, Gabriel Synnaeve, Varun Vontimitta, Spencer Whitman, Joshua Saxe

Abstract: This paper presents CyberSecEval, a comprehensive benchmark developed to help bolster the cybersecurity of Large Language Models (LLMs) employed as coding assistants. As what we believe to be the most extensive unified cybersecurity safety benchmark to date, CyberSecEval provides a thorough evaluation of LLMs in two crucial security domains: their propensity to generate insecure code and their lev… ▽ More This paper presents CyberSecEval, a comprehensive benchmark developed to help bolster the cybersecurity of Large Language Models (LLMs) employed as coding assistants. As what we believe to be the most extensive unified cybersecurity safety benchmark to date, CyberSecEval provides a thorough evaluation of LLMs in two crucial security domains: their propensity to generate insecure code and their level of compliance when asked to assist in cyberattacks. Through a case study involving seven models from the Llama 2, Code Llama, and OpenAI GPT large language model families, CyberSecEval effectively pinpointed key cybersecurity risks. More importantly, it offered practical insights for refining these models. A significant observation from the study was the tendency of more advanced models to suggest insecure code, highlighting the critical need for integrating security considerations in the development of sophisticated LLMs. CyberSecEval, with its automated test case generation and evaluation pipeline covers a broad scope and equips LLM designers and researchers with a tool to broadly measure and enhance the cybersecurity safety properties of LLMs, contributing to the development of more secure AI systems. △ Less

Submitted 7 December, 2023; originally announced December 2023.

arXiv:1903.00503 [pdf, other]

Automatic Techniques to Systematically Discover New Heap Exploitation Primitives

Authors: Insu Yun, Dhaval Kapil, Taesoo Kim

Abstract: Heap exploitation techniques to abuse the metadata of allocators have been widely studied since they are application independent and can be used in restricted environments that corrupt only metadata. Although prior work has found several interesting exploitation techniques, they are ad-hoc and manual, which cannot effectively handle changes or a variety of allocators. In this paper, we present a… ▽ More Heap exploitation techniques to abuse the metadata of allocators have been widely studied since they are application independent and can be used in restricted environments that corrupt only metadata. Although prior work has found several interesting exploitation techniques, they are ad-hoc and manual, which cannot effectively handle changes or a variety of allocators. In this paper, we present a new naming scheme for heap exploitation techniques that systematically organizes them to discover the unexplored space in finding the techniques and ArcHeap, the tool that finds heap exploitation techniques automatically and systematically regardless of their underlying implementations. For that, ArcHeap generates a set of heap actions (e.g. allocation or deallocation) by leveraging fuzzing, which exploits common designs of modern heap allocators. Then, ArcHeap checks whether the actions result in impact of exploitations such as arbitrary write or overlapped chunks that efficiently determine if the actions can be converted into the exploitation technique. Finally, from these actions, ArcHeap generates Proof-of-Concept code automatically for an exploitation technique. We evaluated ArcHeap with real-world allocators --- ptmalloc, jemalloc, and tcmalloc --- and custom allocators from the DARPA Cyber Grand Challenge. ArcHeap successfully found 14 out of 16 known exploitation techniques and found five new exploitation techniques in ptmalloc. Moreover, ArcHeap found several exploitation techniques for jemalloc, tcmalloc, and even for the custom allocators. Further, ArcHeap can automatically show changes in exploitation techniques along with version change in ptmalloc using differential testing. △ Less

Submitted 1 March, 2019; originally announced March 2019.

Showing 1–3 of 3 results for author: Kapil, D