Search | arXiv e-print repository

Evaluating LLMs for Hardware Design and Test

Authors: Jason Blocklove, Siddharth Garg, Ramesh Karri, Hammond Pearce

Abstract: Large Language Models (LLMs) have demonstrated capabilities for producing code in Hardware Description Languages (HDLs). However, most of the focus remains on their abilities to write functional code, not test code. The hardware design process consists of both design and test, and so eschewing validation and verification leaves considerable potential benefit unexplored, given that a design and tes… ▽ More Large Language Models (LLMs) have demonstrated capabilities for producing code in Hardware Description Languages (HDLs). However, most of the focus remains on their abilities to write functional code, not test code. The hardware design process consists of both design and test, and so eschewing validation and verification leaves considerable potential benefit unexplored, given that a design and test framework may allow for progress towards full automation of the digital design pipeline. In this work, we perform one of the first studies exploring how a LLM can both design and test hardware modules from provided specifications. Using a suite of 8 representative benchmarks, we examined the capabilities and limitations of the state-of-the-art conversational LLMs when producing Verilog for functional and verification purposes. We taped out the benchmarks on a Skywater 130nm shuttle and received the functional chip. △ Less

Submitted 23 April, 2024; originally announced May 2024.

arXiv:2404.15446 [pdf, other]

OffRAMPS: An FPGA-based Intermediary for Analysis and Modification of Additive Manufacturing Control Systems

Authors: Jason Blocklove, Md Raz, Prithwish Basu Roy, Hammond Pearce, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri

Abstract: Cybersecurity threats in Additive Manufacturing (AM) are an increasing concern as AM adoption continues to grow. AM is now being used for parts in the aerospace, transportation, and medical domains. Threat vectors which allow for part compromise are particularly concerning, as any failure in these domains would have life-threatening consequences. A major challenge to investigation of AM part-compr… ▽ More Cybersecurity threats in Additive Manufacturing (AM) are an increasing concern as AM adoption continues to grow. AM is now being used for parts in the aerospace, transportation, and medical domains. Threat vectors which allow for part compromise are particularly concerning, as any failure in these domains would have life-threatening consequences. A major challenge to investigation of AM part-compromises comes from the difficulty in evaluating and benchmarking both identified threat vectors as well as methods for detecting adversarial actions. In this work, we introduce a generalized platform for systematic analysis of attacks against and defenses for 3D printers. Our "OFFRAMPS" platform is based on the open-source 3D printer control board "RAMPS." OFFRAMPS allows analysis, recording, and modification of all control signals and I/O for a 3D printer. We show the efficacy of OFFRAMPS by presenting a series of case studies based on several Trojans, including ones identified in the literature, and show that OFFRAMPS can both emulate and detect these attacks, i.e., it can both change and detect arbitrary changes to the g-code print commands. △ Less

Submitted 23 April, 2024; originally announced April 2024.

arXiv:2404.07235 [pdf, other]

Explaining EDA synthesis errors with LLMs

Authors: Siyu Qiu, Benjamin Tan, Hammond Pearce

Abstract: Training new engineers in digital design is a challenge, particularly when it comes to teaching the complex electronic design automation (EDA) tooling used in this domain. Learners will typically deploy designs in the Verilog and VHDL hardware description languages to Field Programmable Gate Arrays (FPGAs) from Altera (Intel) and Xilinx (AMD) via proprietary closed-source toolchains (Quartus Prime… ▽ More Training new engineers in digital design is a challenge, particularly when it comes to teaching the complex electronic design automation (EDA) tooling used in this domain. Learners will typically deploy designs in the Verilog and VHDL hardware description languages to Field Programmable Gate Arrays (FPGAs) from Altera (Intel) and Xilinx (AMD) via proprietary closed-source toolchains (Quartus Prime and Vivado, respectively). These tools are complex and difficult to use -- yet, as they are the tools used in industry, they are an essential first step in this space. In this work, we examine how recent advances in artificial intelligence may be leveraged to address aspects of this challenge. Specifically, we investigate if Large Language Models (LLMs), which have demonstrated text comprehension and question-answering capabilities, can be used to generate novice-friendly explanations of compile-time synthesis error messages from Quartus Prime and Vivado. To perform this study we generate 936 error message explanations using three OpenAI LLMs over 21 different buggy code samples. These are then graded for relevance and correctness, and we find that in approximately 71% of cases the LLMs give correct & complete explanations suitable for novice learners. △ Less

Submitted 7 April, 2024; originally announced April 2024.

Comments: 6 pages, 6 figures

arXiv:2312.12575 [pdf, other]

LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?): A Comprehensive Evaluation, Framework, and Benchmarks

Authors: Saad Ullah, Mingji Han, Saurabh Pujar, Hammond Pearce, Ayse Coskun, Gianluca Stringhini

Abstract: Large Language Models (LLMs) have been suggested for use in automated vulnerability repair, but benchmarks showing they can consistently identify security-related bugs are lacking. We thus develop SecLLMHolmes, a fully automated evaluation framework that performs the most detailed investigation to date on whether LLMs can reliably identify and reason about security-related bugs. We construct a set… ▽ More Large Language Models (LLMs) have been suggested for use in automated vulnerability repair, but benchmarks showing they can consistently identify security-related bugs are lacking. We thus develop SecLLMHolmes, a fully automated evaluation framework that performs the most detailed investigation to date on whether LLMs can reliably identify and reason about security-related bugs. We construct a set of 228 code scenarios and analyze eight of the most capable LLMs across eight different investigative dimensions using our framework. Our evaluation shows LLMs provide non-deterministic responses, incorrect and unfaithful reasoning, and perform poorly in real-world scenarios. Most importantly, our findings reveal significant non-robustness in even the most advanced models like `PaLM2' and `GPT-4': by merely changing function or variable names, or by the addition of library functions in the source code, these models can yield incorrect answers in 26% and 17% of cases, respectively. These findings demonstrate that further LLM advances are needed before LLMs can be used as general purpose security assistants. △ Less

Submitted 13 April, 2024; v1 submitted 19 December, 2023; originally announced December 2023.

Comments: Accepted for publication in IEEE Symposium on Security and Privacy 2024

arXiv:2311.04887 [pdf, other]

AutoChip: Automating HDL Generation Using LLM Feedback

Authors: Shailja Thakur, Jason Blocklove, Hammond Pearce, Benjamin Tan, Siddharth Garg, Ramesh Karri

Abstract: Traditionally, designs are written in Verilog hardware description language (HDL) and debugged by hardware engineers. While this approach is effective, it is time-consuming and error-prone for complex designs. Large language models (LLMs) are promising in automating HDL code generation. LLMs are trained on massive datasets of text and code, and they can learn to generate code that compiles and is… ▽ More Traditionally, designs are written in Verilog hardware description language (HDL) and debugged by hardware engineers. While this approach is effective, it is time-consuming and error-prone for complex designs. Large language models (LLMs) are promising in automating HDL code generation. LLMs are trained on massive datasets of text and code, and they can learn to generate code that compiles and is functionally accurate. We aim to evaluate the ability of LLMs to generate functionally correct HDL models. We build AutoChip by combining the interactive capabilities of LLMs and the output from Verilog simulations to generate Verilog modules. We start with a design prompt for a module and the context from compilation errors and debugging messages, which highlight differences between the expected and actual outputs. This ensures that accurate Verilog code can be generated without human intervention. We evaluate AutoChip using problem sets from HDLBits. We conduct a comprehensive analysis of the AutoChip using several LLMs and problem categories. The results show that incorporating context from compiler tools, such as Icarus Verilog, improves the effectiveness, yielding 24.20% more accurate Verilog. We release our evaluation scripts and datasets as open-source contributions at the following link https://github.com/shailja-thakur/AutoChip. △ Less

Submitted 4 June, 2024; v1 submitted 8 November, 2023; originally announced November 2023.

arXiv:2310.10560 [pdf, other]

Towards the Imagenets of ML4EDA

Authors: Animesh Basak Chowdhury, Shailja Thakur, Hammond Pearce, Ramesh Karri, Siddharth Garg

Abstract: Despite the growing interest in ML-guided EDA tools from RTL to GDSII, there are no standard datasets or prototypical learning tasks defined for the EDA problem domain. Experience from the computer vision community suggests that such datasets are crucial to spur further progress in ML for EDA. Here we describe our experience curating two large-scale, high-quality datasets for Verilog code generati… ▽ More Despite the growing interest in ML-guided EDA tools from RTL to GDSII, there are no standard datasets or prototypical learning tasks defined for the EDA problem domain. Experience from the computer vision community suggests that such datasets are crucial to spur further progress in ML for EDA. Here we describe our experience curating two large-scale, high-quality datasets for Verilog code generation and logic synthesis. The first, VeriGen, is a dataset of Verilog code collected from GitHub and Verilog textbooks. The second, OpenABC-D, is a large-scale, labeled dataset designed to aid ML for logic synthesis tasks. The dataset consists of 870,000 And-Inverter-Graphs (AIGs) produced from 1500 synthesis runs on a large number of open-source hardware projects. In this paper we will discuss challenges in curating, maintaining and growing the size and scale of these datasets. We will also touch upon questions of dataset quality and security, and the use of novel data augmentation tools that are tailored for the hardware domain. △ Less

Submitted 16 October, 2023; originally announced October 2023.

Comments: Invited paper, ICCAD 2023

Report number: October 16 Update

Journal ref: ICCAD 2023

arXiv:2310.05135 [pdf, other]

Are Emily and Greg Still More Employable than Lakisha and Jamal? Investigating Algorithmic Hiring Bias in the Era of ChatGPT

Authors: Akshaj Kumar Veldanda, Fabian Grob, Shailja Thakur, Hammond Pearce, Benjamin Tan, Ramesh Karri, Siddharth Garg

Abstract: Large Language Models (LLMs) such as GPT-3.5, Bard, and Claude exhibit applicability across numerous tasks. One domain of interest is their use in algorithmic hiring, specifically in matching resumes with job categories. Yet, this introduces issues of bias on protected attributes like gender, race and maternity status. The seminal work of Bertrand & Mullainathan (2003) set the gold-standard for id… ▽ More Large Language Models (LLMs) such as GPT-3.5, Bard, and Claude exhibit applicability across numerous tasks. One domain of interest is their use in algorithmic hiring, specifically in matching resumes with job categories. Yet, this introduces issues of bias on protected attributes like gender, race and maternity status. The seminal work of Bertrand & Mullainathan (2003) set the gold-standard for identifying hiring bias via field experiments where the response rate for identical resumes that differ only in protected attributes, e.g., racially suggestive names such as Emily or Lakisha, is compared. We replicate this experiment on state-of-art LLMs (GPT-3.5, Bard, Claude and Llama) to evaluate bias (or lack thereof) on gender, race, maternity status, pregnancy status, and political affiliation. We evaluate LLMs on two tasks: (1) matching resumes to job categories; and (2) summarizing resumes with employment relevant information. Overall, LLMs are robust across race and gender. They differ in their performance on pregnancy status and political affiliation. We use contrastive input decoding on open-source LLMs to uncover potential sources of bias. △ Less

Submitted 8 October, 2023; originally announced October 2023.

arXiv:2308.11873 [pdf, other]

Dcc --help: Generating Context-Aware Compiler Error Explanations with Large Language Models

Authors: Andrew Taylor, Alexandra Vassar, Jake Renzella, Hammond Pearce

Abstract: In the challenging field of introductory programming, high enrollments and failure rates drive us to explore tools and systems to enhance student outcomes, especially automated tools that scale to large cohorts. This paper presents and evaluates the dcc --help tool, an integration of a Large Language Model (LLM) into the Debugging C Compiler (DCC) to generate unique, novice-focused explanations ta… ▽ More In the challenging field of introductory programming, high enrollments and failure rates drive us to explore tools and systems to enhance student outcomes, especially automated tools that scale to large cohorts. This paper presents and evaluates the dcc --help tool, an integration of a Large Language Model (LLM) into the Debugging C Compiler (DCC) to generate unique, novice-focused explanations tailored to each error. dcc --help prompts an LLM with contextual information of compile- and run-time error occurrences, including the source code, error location and standard compiler error message. The LLM is instructed to generate novice-focused, actionable error explanations and guidance, designed to help students understand and resolve problems without providing solutions. dcc --help was deployed to our CS1 and CS2 courses, with 2,565 students using the tool over 64,000 times in ten weeks. We analysed a subset of these error/explanation pairs to evaluate their properties, including conceptual correctness, relevancy, and overall quality. We found that the LLM-generated explanations were conceptually accurate in 90% of compile-time and 75% of run-time cases, but often disregarded the instruction not to provide solutions in code. Our findings, observations and reflections following deployment indicate that dcc-help provides novel opportunities for scaffolding students' introduction to programming. △ Less

Submitted 15 October, 2023; v1 submitted 22 August, 2023; originally announced August 2023.

Comments: 7 pages, 2 figures. Accepted in SIGCSE'24

arXiv:2308.00708 [pdf, other]

VeriGen: A Large Language Model for Verilog Code Generation

Authors: Shailja Thakur, Baleegh Ahmad, Hammond Pearce, Benjamin Tan, Brendan Dolan-Gavitt, Ramesh Karri, Siddharth Garg

Abstract: In this study, we explore the capability of Large Language Models (LLMs) to automate hardware design by generating high-quality Verilog code, a common language for designing and modeling digital systems. We fine-tune pre-existing LLMs on Verilog datasets compiled from GitHub and Verilog textbooks. We evaluate the functional correctness of the generated Verilog code using a specially designed test… ▽ More In this study, we explore the capability of Large Language Models (LLMs) to automate hardware design by generating high-quality Verilog code, a common language for designing and modeling digital systems. We fine-tune pre-existing LLMs on Verilog datasets compiled from GitHub and Verilog textbooks. We evaluate the functional correctness of the generated Verilog code using a specially designed test suite, featuring a custom problem set and testing benches. Here, our fine-tuned open-source CodeGen-16B model outperforms the commercial state-of-the-art GPT-3.5-turbo model with a 1.1% overall increase. Upon testing with a more diverse and complex problem set, we find that the fine-tuned model shows competitive performance against state-of-the-art gpt-3.5-turbo, excelling in certain scenarios. Notably, it demonstrates a 41% improvement in generating syntactically correct Verilog code across various problem categories compared to its pre-trained counterpart, highlighting the potential of smaller, in-house LLMs in hardware design automation. △ Less

Submitted 27 July, 2023; originally announced August 2023.

Comments: arXiv admin note: text overlap with arXiv:2212.11140

arXiv:2306.14027 [pdf, other]

LLM-assisted Generation of Hardware Assertions

Authors: Rahul Kande, Hammond Pearce, Benjamin Tan, Brendan Dolan-Gavitt, Shailja Thakur, Ramesh Karri, Jeyavijayan Rajendran

Abstract: The security of computer systems typically relies on a hardware root of trust. As vulnerabilities in hardware can have severe implications on a system, there is a need for techniques to support security verification activities. Assertion-based verification is a popular verification technique that involves capturing design intent in a set of assertions that can be used in formal verification or tes… ▽ More The security of computer systems typically relies on a hardware root of trust. As vulnerabilities in hardware can have severe implications on a system, there is a need for techniques to support security verification activities. Assertion-based verification is a popular verification technique that involves capturing design intent in a set of assertions that can be used in formal verification or testing-based checking. However, writing security-centric assertions is a challenging task. In this work, we investigate the use of emerging large language models (LLMs) for code generation in hardware assertion generation for security, where primarily natural language prompts, such as those one would see as code comments in assertion files, are used to produce SystemVerilog assertions. We focus our attention on a popular LLM and characterize its ability to write assertions out of the box, given varying levels of detail in the prompt. We design an evaluation framework that generates a variety of prompts, and we create a benchmark suite comprising real-world hardware designs and corresponding golden reference assertions that we want to generate with the LLM. △ Less

Submitted 24 June, 2023; originally announced June 2023.

arXiv:2306.12643 [pdf, other]

FLAG: Finding Line Anomalies (in code) with Generative AI

Authors: Baleegh Ahmad, Benjamin Tan, Ramesh Karri, Hammond Pearce

Abstract: Code contains security and functional bugs. The process of identifying and localizing them is difficult and relies on human labor. In this work, we present a novel approach (FLAG) to assist human debuggers. FLAG is based on the lexical capabilities of generative AI, specifically, Large Language Models (LLMs). Here, we input a code file then extract and regenerate each line within that file for sel… ▽ More Code contains security and functional bugs. The process of identifying and localizing them is difficult and relies on human labor. In this work, we present a novel approach (FLAG) to assist human debuggers. FLAG is based on the lexical capabilities of generative AI, specifically, Large Language Models (LLMs). Here, we input a code file then extract and regenerate each line within that file for self-comparison. By comparing the original code with an LLM-generated alternative, we can flag notable differences as anomalies for further inspection, with features such as distance from comments and LLM confidence also aiding this classification. This reduces the inspection search space for the designer. Unlike other automated approaches in this area, FLAG is language-agnostic, can work on incomplete (and even non-compiling) code and requires no creation of security properties, functional tests or definition of rules. In this work, we explore the features that help LLMs in this classification and evaluate the performance of FLAG on known bugs. We use 121 benchmarks across C, Python and Verilog; with each benchmark containing a known security or functional weakness. We conduct the experiments using two state of the art LLMs in OpenAI's code-davinci-002 and gpt-3.5-turbo, but our approach may be used by other models. FLAG can identify 101 of the defects and helps reduce the search space to 12-17% of source code. △ Less

Submitted 21 June, 2023; originally announced June 2023.

arXiv:2305.13243 [pdf, other]

doi 10.1109/MLCAD58807.2023.10299874

Chip-Chat: Challenges and Opportunities in Conversational Hardware Design

Authors: Jason Blocklove, Siddharth Garg, Ramesh Karri, Hammond Pearce

Abstract: Modern hardware design starts with specifications provided in natural language. These are then translated by hardware engineers into appropriate Hardware Description Languages (HDLs) such as Verilog before synthesizing circuit elements. Automating this translation could reduce sources of human error from the engineering process. But, it is only recently that artificial intelligence (AI) has demons… ▽ More Modern hardware design starts with specifications provided in natural language. These are then translated by hardware engineers into appropriate Hardware Description Languages (HDLs) such as Verilog before synthesizing circuit elements. Automating this translation could reduce sources of human error from the engineering process. But, it is only recently that artificial intelligence (AI) has demonstrated capabilities for machine-based end-to-end design translations. Commercially-available instruction-tuned Large Language Models (LLMs) such as OpenAI's ChatGPT and Google's Bard claim to be able to produce code in a variety of programming languages; but studies examining them for hardware are still lacking. In this work, we thus explore the challenges faced and opportunities presented when leveraging these recent advances in LLMs for hardware design. Given that these `conversational' LLMs perform best when used interactively, we perform a case study where a hardware engineer co-architects a novel 8-bit accumulator-based microprocessor architecture with the LLM according to real-world hardware constraints. We then sent the processor to tapeout in a Skywater 130nm shuttle, meaning that this `Chip-Chat' resulted in what we believe to be the world's first wholly-AI-written HDL for tapeout. △ Less

Submitted 14 November, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

Comments: 6 pages, 8 figures. Accepted in 2023 ACM/IEEE 5th Workshop on Machine Learning for CAD (MLCAD)

arXiv:2305.06902 [pdf, other]

REMaQE: Reverse Engineering Math Equations from Executables

Authors: Meet Udeshi, Prashanth Krishnamurthy, Hammond Pearce, Ramesh Karri, Farshad Khorrami

Abstract: Cybersecurity attacks on embedded devices for industrial control systems and cyber-physical systems may cause catastrophic physical damage as well as economic loss. This could be achieved by infecting device binaries with malware that modifies the physical characteristics of the system operation. Mitigating such attacks benefits from reverse engineering tools that recover sufficient semantic knowl… ▽ More Cybersecurity attacks on embedded devices for industrial control systems and cyber-physical systems may cause catastrophic physical damage as well as economic loss. This could be achieved by infecting device binaries with malware that modifies the physical characteristics of the system operation. Mitigating such attacks benefits from reverse engineering tools that recover sufficient semantic knowledge in terms of mathematical equations of the implemented algorithm. Conventional reverse engineering tools can decompile binaries to low-level code, but offer little semantic insight. This paper proposes the REMaQE automated framework for reverse engineering of math equations from binary executables. Improving over state-of-the-art, REMaQE handles equation parameters accessed via registers, the stack, global memory, or pointers, and can reverse engineer object-oriented implementations such as C++ classes. Using REMaQE, we discovered a bug in the Linux kernel thermal monitoring tool "tmon". To evaluate REMaQE, we generate a dataset of 25,096 binaries with math equations implemented in C and Simulink. REMaQE successfully recovers a semantically matching equation for all 25,096 binaries. REMaQE executes in 0.48 seconds on average and in up to 2 seconds for complex equations. Real-time execution enables integration in an interactive math-oriented reverse engineering workflow. △ Less

Submitted 11 April, 2024; v1 submitted 11 May, 2023; originally announced May 2023.

ACM Class: C.3; D.2.5

arXiv:2302.01215 [pdf, other]

doi 10.1109/TIFS.2024.3374558

Fixing Hardware Security Bugs with Large Language Models

Authors: Baleegh Ahmad, Shailja Thakur, Benjamin Tan, Ramesh Karri, Hammond Pearce

Abstract: Novel AI-based code-writing Large Language Models (LLMs) such as OpenAI's Codex have demonstrated capabilities in many coding-adjacent domains. In this work we consider how LLMs maybe leveraged to automatically repair security relevant bugs present in hardware designs. We focus on bug repair in code written in the Hardware Description Language Verilog. For this study we build a corpus of domain-re… ▽ More Novel AI-based code-writing Large Language Models (LLMs) such as OpenAI's Codex have demonstrated capabilities in many coding-adjacent domains. In this work we consider how LLMs maybe leveraged to automatically repair security relevant bugs present in hardware designs. We focus on bug repair in code written in the Hardware Description Language Verilog. For this study we build a corpus of domain-representative hardware security bugs. We then design and implement a framework to quantitatively evaluate the performance of any LLM tasked with fixing the specified bugs. The framework supports design space exploration of prompts (i.e., prompt engineering) and identifying the best parameters for the LLM. We show that an ensemble of LLMs can repair all ten of our benchmarks. This ensemble outperforms the state-of-the-art Cirfix hardware bug repair tool on its own suite of bugs. These results show that LLMs can repair hardware security bugs and the framework is an important step towards the ultimate goal of an automated end-to-end bug repair framework. △ Less

Submitted 2 February, 2023; originally announced February 2023.

arXiv:2301.10336 [pdf, other]

A survey of Digital Manufacturing Hardware and Software Trojans

Authors: Prithwish Basu Roy, Mudit Bhargava, Chia-Yun Chang, Ellen Hui, Nikhil Gupta, Ramesh Karri, Hammond Pearce

Abstract: Digital Manufacturing (DM) refers to the on-going adoption of smarter, more agile manufacturing processes and cyber-physical systems. This includes modern techniques and technologies such as Additive Manufacturing (AM)/3D printing, as well as the Industrial Internet of Things (IIoT) and the broader trend toward Industry 4.0. However, this adoption is not without risks: with a growing complexity an… ▽ More Digital Manufacturing (DM) refers to the on-going adoption of smarter, more agile manufacturing processes and cyber-physical systems. This includes modern techniques and technologies such as Additive Manufacturing (AM)/3D printing, as well as the Industrial Internet of Things (IIoT) and the broader trend toward Industry 4.0. However, this adoption is not without risks: with a growing complexity and connectivity, so too grows the cyber-physical attack surface. Here, malicious actors might seek to steal sensitive information or sabotage products or production lines, causing financial and reputational loss. Of particular concern are where such malicious attacks may enter the complex supply chains of DM systems as Trojans -- malicious modifications that may trigger their payloads at later times or stages of the product lifecycle. In this work, we thus present a comprehensive overview of the threats posed by Trojans in Digital Manufacturing. We cover both hardware and software Trojans which may exist in products or their production and supply lines. From this, we produce a novel taxonomy for classifying and analyzing these threats, and elaborate on how different side channels (e.g. visual, thermal, acoustic, power, and magnetic) may be used to either enhance the impact of a given Trojan or utilized as part of a defensive strategy. Other defenses are also presented -- including hardware, web-, and software-related. To conclude, we discuss seven different case studies and elaborate how they fit into our taxonomy. Overall, this paper presents a detailed survey of the Trojan landscape for Digital Manufacturing: threats, defenses, and the importance of implementing secure practices. △ Less

Submitted 24 January, 2023; originally announced January 2023.

Comments: 15 pages

arXiv:2212.11140 [pdf, other]

Benchmarking Large Language Models for Automated Verilog RTL Code Generation

Authors: Shailja Thakur, Baleegh Ahmad, Zhenxing Fan, Hammond Pearce, Benjamin Tan, Ramesh Karri, Brendan Dolan-Gavitt, Siddharth Garg

Abstract: Automating hardware design could obviate a significant amount of human error from the engineering process and lead to fewer errors. Verilog is a popular hardware description language to model and design digital systems, thus generating Verilog code is a critical first step. Emerging large language models (LLMs) are able to write high-quality code in other programming languages. In this paper, we c… ▽ More Automating hardware design could obviate a significant amount of human error from the engineering process and lead to fewer errors. Verilog is a popular hardware description language to model and design digital systems, thus generating Verilog code is a critical first step. Emerging large language models (LLMs) are able to write high-quality code in other programming languages. In this paper, we characterize the ability of LLMs to generate useful Verilog. For this, we fine-tune pre-trained LLMs on Verilog datasets collected from GitHub and Verilog textbooks. We construct an evaluation framework comprising test-benches for functional analysis and a flow to test the syntax of Verilog code generated in response to problems of varying difficulty. Our findings show that across our problem scenarios, the fine-tuning results in LLMs more capable of producing syntactically correct code (25.9% overall). Further, when analyzing functional correctness, a fine-tuned open-source CodeGen LLM can outperform the state-of-the-art commercial Codex LLM (6.5% overall). Training/evaluation scripts and LLM checkpoints are available: https://github.com/shailja-thakur/VGen. △ Less

Submitted 13 December, 2022; originally announced December 2022.

Comments: Accepted in DATE 2023. 7 pages, 4 tables, 7 figures

arXiv:2209.01291 [pdf, other]

doi 10.1145/3508352.3549369

Don't CWEAT It: Toward CWE Analysis Techniques in Early Stages of Hardware Design

Authors: Baleegh Ahmad, Wei-Kai Liu, Luca Collini, Hammond Pearce, Jason M. Fung, Jonathan Valamehr, Mohammad Bidmeshki, Piotr Sapiecha, Steve Brown, Krishnendu Chakrabarty, Ramesh Karri, Benjamin Tan

Abstract: To help prevent hardware security vulnerabilities from propagating to later design stages where fixes are costly, it is crucial to identify security concerns as early as possible, such as in RTL designs. In this work, we investigate the practical implications and feasibility of producing a set of security-specific scanners that operate on Verilog source files. The scanners indicate parts of code t… ▽ More To help prevent hardware security vulnerabilities from propagating to later design stages where fixes are costly, it is crucial to identify security concerns as early as possible, such as in RTL designs. In this work, we investigate the practical implications and feasibility of producing a set of security-specific scanners that operate on Verilog source files. The scanners indicate parts of code that might contain one of a set of MITRE's common weakness enumerations (CWEs). We explore the CWE database to characterize the scope and attributes of the CWEs and identify those that are amenable to static analysis. We prototype scanners and evaluate them on 11 open source designs - 4 system-on-chips (SoC) and 7 processor cores - and explore the nature of identified weaknesses. Our analysis reported 53 potential weaknesses in the OpenPiton SoC used in Hack@DAC-21, 11 of which we confirmed as security concerns. △ Less

Submitted 2 September, 2022; originally announced September 2022.

arXiv:2208.09727 [pdf, other]

Lost at C: A User Study on the Security Implications of Large Language Model Code Assistants

Authors: Gustavo Sandoval, Hammond Pearce, Teo Nys, Ramesh Karri, Siddharth Garg, Brendan Dolan-Gavitt

Abstract: Large Language Models (LLMs) such as OpenAI Codex are increasingly being used as AI-based coding assistants. Understanding the impact of these tools on developers' code is paramount, especially as recent work showed that LLMs may suggest cybersecurity vulnerabilities. We conduct a security-driven user study (N=58) to assess code written by student programmers when assisted by LLMs. Given the poten… ▽ More Large Language Models (LLMs) such as OpenAI Codex are increasingly being used as AI-based coding assistants. Understanding the impact of these tools on developers' code is paramount, especially as recent work showed that LLMs may suggest cybersecurity vulnerabilities. We conduct a security-driven user study (N=58) to assess code written by student programmers when assisted by LLMs. Given the potential severity of low-level bugs as well as their relative frequency in real-world projects, we tasked participants with implementing a singly-linked 'shop** list' structure in C. Our results indicate that the security impact in this setting (low-level C with pointer and array manipulations) is small: AI-assisted users produce critical security bugs at a rate no greater than 10% more than the control, indicating the use of LLMs does not introduce new security risks. △ Less

Submitted 27 February, 2023; v1 submitted 20 August, 2022; originally announced August 2022.

Comments: Accepted for publication in USENIX'23. For associated dataset see https://doi.org/10.5281/zenodo.7187359. 18 pages, 12 figures. G. Sandoval and H. Pearce contributed equally to this work

arXiv:2207.10466 [pdf, other]

doi 10.1145/3577200

High-Level Approaches to Hardware Security: A Tutorial

Authors: Hammond Pearce, Ramesh Karri, Benjamin Tan

Abstract: Designers use third-party intellectual property (IP) cores and outsource various steps in the integrated circuit (IC) design and manufacturing flow. As a result, security vulnerabilities have been rising. This is forcing IC designers and end users to re-evaluate their trust in ICs. If attackers get hold of an unprotected IC, they can reverse engineer the IC and pirate the IP. Similarly, if attacke… ▽ More Designers use third-party intellectual property (IP) cores and outsource various steps in the integrated circuit (IC) design and manufacturing flow. As a result, security vulnerabilities have been rising. This is forcing IC designers and end users to re-evaluate their trust in ICs. If attackers get hold of an unprotected IC, they can reverse engineer the IC and pirate the IP. Similarly, if attackers get hold of a design, they can insert malicious circuits or take advantage of "backdoors" in a design. Unintended design bugs can also result in security weaknesses. This tutorial paper provides an introduction to the domain of hardware security through two pedagogical examples of hardware security problems. The first is a walk-through of the scan chain-based side channel attack. The second is a walk-through of logic locking of digital designs. The tutorial material is accompanied by open access digital resources that are linked in this article. △ Less

Submitted 6 March, 2023; v1 submitted 21 July, 2022; originally announced July 2022.

Comments: Accepted in IEEE TECS. 41 pages, 13 figures

arXiv:2202.01142 [pdf, other]

Pop Quiz! Can a Large Language Model Help With Reverse Engineering?

Authors: Hammond Pearce, Benjamin Tan, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, Brendan Dolan-Gavitt

Abstract: Large language models (such as OpenAI's Codex) have demonstrated impressive zero-shot multi-task capabilities in the software domain, including code explanation. In this work, we examine if this ability can be used to help with reverse engineering. Specifically, we investigate prompting Codex to identify the purpose, capabilities, and important variable names or values from code, even when the cod… ▽ More Large language models (such as OpenAI's Codex) have demonstrated impressive zero-shot multi-task capabilities in the software domain, including code explanation. In this work, we examine if this ability can be used to help with reverse engineering. Specifically, we investigate prompting Codex to identify the purpose, capabilities, and important variable names or values from code, even when the code is produced through decompilation. Alongside an examination of the model's responses in answering open-ended questions, we devise a true/false quiz framework to characterize the performance of the language model. We present an extensive quantitative analysis of the measured performance of the language model on a set of program purpose identification and information extraction tasks: of the 136,260 questions we posed, it answered 72,754 correctly. A key takeaway is that while promising, LLMs are not yet ready for zero-shot reverse engineering. △ Less

Submitted 2 February, 2022; originally announced February 2022.

Comments: 18 pages, 19 figures. Linked dataset: https://doi.org/10.5281/zenodo.5949075

arXiv:2112.02125 [pdf, other]

Examining Zero-Shot Vulnerability Repair with Large Language Models

Authors: Hammond Pearce, Benjamin Tan, Baleegh Ahmad, Ramesh Karri, Brendan Dolan-Gavitt

Abstract: Human developers can produce code with cybersecurity bugs. Can emerging 'smart' code completion tools help repair those bugs? In this work, we examine the use of large language models (LLMs) for code (such as OpenAI's Codex and AI21's Jurassic J-1) for zero-shot vulnerability repair. We investigate challenges in the design of prompts that coax LLMs into generating repaired versions of insecure cod… ▽ More Human developers can produce code with cybersecurity bugs. Can emerging 'smart' code completion tools help repair those bugs? In this work, we examine the use of large language models (LLMs) for code (such as OpenAI's Codex and AI21's Jurassic J-1) for zero-shot vulnerability repair. We investigate challenges in the design of prompts that coax LLMs into generating repaired versions of insecure code. This is difficult due to the numerous ways to phrase key information - both semantically and syntactically - with natural languages. We perform a large scale study of five commercially available, black-box, "off-the-shelf" LLMs, as well as an open-source model and our own locally-trained model, on a mix of synthetic, hand-crafted, and real-world security bug scenarios. Our experiments demonstrate that while the approach has promise (the LLMs could collectively repair 100% of our synthetically generated and hand-crafted scenarios), a qualitative evaluation of the model's performance over a corpus of historical real-world examples highlights challenges in generating functionally correct code. △ Less

Submitted 15 August, 2022; v1 submitted 3 December, 2021; originally announced December 2021.

Comments: 18 pages, 19 figures. Accepted for publication in 2023 IEEE Symposium on Security and Privacy (SP)

arXiv:2111.12746 [pdf, other]

doi 10.1109/LES.2021.3129108

Needle in a Haystack: Detecting Subtle Malicious Edits to Additive Manufacturing G-code Files

Authors: Caleb Beckwith, Harsh Sankar Naicker, Svara Mehta, Viba R. Udupa, Nghia Tri Nim, Varun Gadre, Hammond Pearce, Gary Mac, Nikhil Gupta

Abstract: Increasing usage of Digital Manufacturing (DM) in safety-critical domains is increasing attention on the cybersecurity of the manufacturing process, as malicious third parties might aim to introduce defects in digital designs. In general, the DM process involves creating a digital object (as CAD files) before using a slicer program to convert the models into printing instructions (e.g. g-code) sui… ▽ More Increasing usage of Digital Manufacturing (DM) in safety-critical domains is increasing attention on the cybersecurity of the manufacturing process, as malicious third parties might aim to introduce defects in digital designs. In general, the DM process involves creating a digital object (as CAD files) before using a slicer program to convert the models into printing instructions (e.g. g-code) suitable for the target printer. As the g-code is an intermediate machine format, malicious edits may be difficult to detect, especially when the golden (original) models are not available to the manufacturer. In this work we aim to quantify this hypothesis through a red-team/blue-team case study, whereby the red-team aims to introduce subtle defects that would impact the properties (strengths) of the 3D printed parts, and the blue-team aims to detect these modifications in the absence of the golden models. The case study had two sets of models, the first with 180 designs (with 2 compromised using 2 methods) and the second with 4320 designs (with 60 compromised using 6 methods). Using statistical modelling and machine learning (ML), the blue-team was able to detect all the compromises in the first set of data, and 50 of the compromises in the second. △ Less

Submitted 24 November, 2021; originally announced November 2021.

Comments: To appear in IEEE Embedded Systems Letters

arXiv:2110.01974 [pdf, other]

Runtime Interchange for Adaptive Re-use of Intelligent Cyber-Physical System Controllers

Authors: Hammond Pearce, Xin Yang, Srinivas Pinisetty, Partha S. Roop

Abstract: Cyber-Physical Systems (CPSs) such as those found within autonomous vehicles are increasingly adopting Artificial Neural Network (ANN)-based controllers. To ensure the safety of these controllers, there is a spate of recent activity to formally verify the ANN-based designs. There are two challenges with these approaches: (1) The verification of such systems is difficult and time consuming. (2) The… ▽ More Cyber-Physical Systems (CPSs) such as those found within autonomous vehicles are increasingly adopting Artificial Neural Network (ANN)-based controllers. To ensure the safety of these controllers, there is a spate of recent activity to formally verify the ANN-based designs. There are two challenges with these approaches: (1) The verification of such systems is difficult and time consuming. (2) These verified controllers are not able to adapt to frequent requirements changes, which are typical in situations like autonomous driving. This raises the question: how can trained and verified controllers, which have gone through expensive training and verification processes, be re-used to deal with requirement changes? This paper addresses this challenge for the first time by proposing a new framework that is capable of dealing with requirement changes at runtime through a mechanism we term runtime interchange. Our approach functions via a continual exchange and selection process of multiple pre-verified controllers. It represents a key step on the way to component-oriented engineering for intelligent designs, as it preserves the behaviours of the original controllers while introducing additional functionality. To demonstrate the efficacy of our approach we utilise an existing autonomous driving case study as well as a set of smaller benchmarks. These show that introduced overheads are extremely minimal and that the approach is very scalable. △ Less

Submitted 23 September, 2021; originally announced October 2021.

Comments: 10 pages, 7 figures

arXiv:2108.09293 [pdf, other]

Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions

Authors: Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, Ramesh Karri

Abstract: There is burgeoning interest in designing AI-based systems to assist humans in designing computing systems, including tools that automatically generate computer code. The most notable of these comes in the form of the first self-described `AI pair programmer', GitHub Copilot, a language model trained over open-source GitHub code. However, code often contains bugs - and so, given the vast quantity… ▽ More There is burgeoning interest in designing AI-based systems to assist humans in designing computing systems, including tools that automatically generate computer code. The most notable of these comes in the form of the first self-described `AI pair programmer', GitHub Copilot, a language model trained over open-source GitHub code. However, code often contains bugs - and so, given the vast quantity of unvetted code that Copilot has processed, it is certain that the language model will have learned from exploitable, buggy code. This raises concerns on the security of Copilot's code contributions. In this work, we systematically investigate the prevalence and conditions that can cause GitHub Copilot to recommend insecure code. To perform this analysis we prompt Copilot to generate code in scenarios relevant to high-risk CWEs (e.g. those from MITRE's "Top 25" list). We explore Copilot's performance on three distinct code generation axes -- examining how it performs given diversity of weaknesses, diversity of prompts, and diversity of domains. In total, we produce 89 different scenarios for Copilot to complete, producing 1,689 programs. Of these, we found approximately 40% to be vulnerable. △ Less

Submitted 16 December, 2021; v1 submitted 20 August, 2021; originally announced August 2021.

Comments: Accepted for publication in IEEE Symposium on Security and Privacy 2022

arXiv:2104.09562 [pdf, other]

FLAW3D: A Trojan-based Cyber Attack on the Physical Outcomes of Additive Manufacturing

Authors: Hammond Pearce, Kaushik Yanamandra, Nikhil Gupta, Ramesh Karri

Abstract: Additive Manufacturing (AM) systems such as 3D printers use inexpensive microcontrollers that rarely feature cybersecurity defenses. This is a risk, especially given the rising threat landscape within the larger digital manufacturing domain. In this work we demonstrate this risk by presenting the design and study of a malicious Trojan (the FLAW3D bootloader) for AVR-based Marlin-compatible 3D prin… ▽ More Additive Manufacturing (AM) systems such as 3D printers use inexpensive microcontrollers that rarely feature cybersecurity defenses. This is a risk, especially given the rising threat landscape within the larger digital manufacturing domain. In this work we demonstrate this risk by presenting the design and study of a malicious Trojan (the FLAW3D bootloader) for AVR-based Marlin-compatible 3D printers (>100 commercial models). We show that the Trojan can hide from programming tools, and even within tight design constraints (less than 1.7 kilobytes in size), it can compromise the quality of additively manufactured prints and reduce tensile strengths by up to 50%. △ Less

Submitted 19 April, 2021; originally announced April 2021.

Comments: 8 pages, 11 figures

arXiv:2009.01026 [pdf, other]

doi 10.1145/3380446.3430634

DAVE: Deriving Automatically Verilog from English

Authors: Hammond Pearce, Benjamin Tan, Ramesh Karri

Abstract: While specifications for digital systems are provided in natural language, engineers undertake significant efforts to translate them into the programming languages understood by compilers for digital systems. Automating this process allows designers to work with the language in which they are most comfortable --the original natural language -- and focus instead on other downstream design challenge… ▽ More While specifications for digital systems are provided in natural language, engineers undertake significant efforts to translate them into the programming languages understood by compilers for digital systems. Automating this process allows designers to work with the language in which they are most comfortable --the original natural language -- and focus instead on other downstream design challenges. We explore the use of state-of-the-art machine learning (ML) to automatically derive Verilog snippets from English via fine-tuning GPT-2, a natural language ML system. We describe our approach for producing a suitable dataset of novice-level digital design tasks and provide a detailed exploration of GPT-2, finding encouraging translation performance across our task sets (94.8% correct), with the ability to handle both simple and abstract design tasks. △ Less

Submitted 27 August, 2020; originally announced September 2020.

Comments: 6 pages, 2 figures

arXiv:2008.11830 [pdf, other]

doi 10.1109/LES.2020.3009910

Designing Neural Networks for Real-Time Systems

Authors: Hammond Pearce, Xin Yang, Partha S. Roop, Marc Katzef, Tórur Biskopstø Strøm

Abstract: Artificial Neural Networks (ANNs) are increasingly being used within safety-critical Cyber-Physical Systems (CPSs). They are often co-located with traditional embedded software, and may perform advisory or control-based roles. It is important to validate both the timing and functional correctness of these systems. However, most approaches in the literature consider guaranteeing only the functional… ▽ More Artificial Neural Networks (ANNs) are increasingly being used within safety-critical Cyber-Physical Systems (CPSs). They are often co-located with traditional embedded software, and may perform advisory or control-based roles. It is important to validate both the timing and functional correctness of these systems. However, most approaches in the literature consider guaranteeing only the functionality of ANN based controllers. This issue stems largely from the implementation strategies used within common neural network frameworks -- their underlying source code is often simply unsuitable for formal techniques such as static timing analysis. As a result, developers of safety-critical CPS must rely on informal techniques such as measurement based approaches to prove correctness, techniques that provide weak guarantees at best. In this work we address this challenge. We propose a design pipeline whereby neural networks trained using the popular deep learning framework Keras are compiled to functionally equivalent C code. This C code is restricted to simple constructs that may be analysed by existing static timing analysis tools. As a result, if compiled to a suitable time-predictable platform all execution bounds may be statically derived. To demonstrate the benefits of our approach we execute an ANN trained to drive an autonomous vehicle around a race track. We compile the ANN to the Patmos time-predictable controller, and show that we can derive worst case execution timings. △ Less

Submitted 26 August, 2020; originally announced August 2020.

Comments: 4 pages, 2 figures. IEEE Embedded Systems Letters, 2020

Showing 1–27 of 27 results for author: Pearce, H