HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: bigstrut

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY-NC-SA 4.0
arXiv:2403.11671v1 [cs.AR] 18 Mar 2024

HDLdebugger: Streamlining HDL debugging with Large Language Models

Xufeng Yao*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT CUHK & HuaweiHong Kong SAR, China [email protected] Haoyang Li*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT HuaweiHong Kong SAR, China [email protected] Tsz Ho Chan*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT HuaweiHong Kong SAR, China [email protected] Wenyi Xiao HuaweiHong Kong SAR, China [email protected] Mingxuan Yuan HuaweiHong Kong SAR, China [email protected] Yu Huang HiSiliconShenZhen, China [email protected] Lei Chen{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT HKUST & HKSUT (GZ)Hong Kong SAR, China [email protected]  and  Bei Yu{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT CUHKHong Kong SAR, China [email protected]
(2024)
Abstract.

In the domain of chip design, Hardware Description Languages (HDLs) play a pivotal role. However, due to the complex syntax of HDLs and the limited availability of online resources, debugging HDL codes remains a difficult and time-intensive task, even for seasoned engineers. Consequently, there is a pressing need to develop automated HDL code debugging models, which can alleviate the burden on hardware engineers. Despite the strong capabilities of Large Language Models (LLMs) in generating, completing, and debugging software code, their utilization in the specialized field of HDL debugging has been limited and, to date, has not yielded satisfactory results. In this paper, we propose an LLM-assisted HDL debugging framework, namely HDLdebugger, which consists of HDL debugging data generation via a reverse engineering approach, a search engine for retrieval-augmented generation, and a retrieval-augmented LLM fine-tuning approach. Through the integration of these components, HDLdebugger can automate and streamline HDL debugging for chip design. Our comprehensive experiments, conducted on an HDL code dataset sourced from Huawei, reveal that HDLdebugger outperforms 13 cutting-edge LLM baselines, displaying exceptional effectiveness in HDL code debugging.

Code Debugging, Large Language Model, Retrieval Augmented Generation
*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT Equal contribution, {}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT Corresponding author,
This work was completed during Xufeng Yao{}^{\prime}start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPTs internship at Huawei
copyright: acmlicensedjournalyear: 2024doi: XXXXXXX.XXXXXXXconference: ; August 25–29, 2024; Barcelona, Spainisbn: 978-1-4503-XXXX-X/18/06ccs: Theory of computation Program semanticsccs: Computing methodologies Natural language processingccs: Hardware Hardware description languages and compilation

1. introduction

Hardware Description Languages (HDLs) are crucial in the realm of chip design, serving as the cornerstone for creating, testing, and implementing digital systems (Gordon, 1995; Cong et al., 2011; Zhang et al., 2015). Due to their critical role, the domain of HDL debugging has received comparatively scant attention. Traditional debugging approaches primarily involve manual code correction based on syntactic guidelines, followed by iterative testing through compilers. This process, while straightforward for languages such as Python and Java, becomes markedly more complex for HDLs due to their sophisticated syntax and the scarcity of accessible resources online. Furthermore, compiler-based testing of HDL code, especially in the context of chip development, is exceptionally time-consuming and resource-intensive.

Despite the high demand in the industry for effective HDL debugging techniques and the promising directions they offer, existing methodologies often fall short in addressing the complexities of the problem. For example, the template-based method (Jiang et al., 2018; Liu et al., 2019; Monperrus, 2018; Huang et al., 2023), a traditional strategy in code debugging, utilizes expert-defined code patterns or heuristics to identify and correct errors. However, this approach is inherently limited, capable of rectifying only those errors with predefined patterns. Consequently, it lacks the flexibility and adaptability necessary to tackle a diverse array of bugs.

Table 1. Pilot Debugging Experiments
Method GPT4 (Achiam et al., 2023) RTLFixer (Tsai et al., 2023) VeriGen (Thakur et al., 2023)
Pass-rate@@@@1 6.35% 28.35% 1.34%

Recently, researchers have delved into the direct application of large language models (LLMs) to rectify buggy code. The underlying hypothesis is that LLMs, pre-trained on extensive repositories of open-source code snippets and text, such as Python, can effectively discern bug patterns and automatically repair buggy code. To assess the efficacy of current LLM-based approaches in addressing industry-level HDL debugging challenges, we conduct a pilot experiment on three typical methods as shown in Table 1. Among the methods evaluated, GPT4 (Achiam et al., 2023) is the current state-of-art LLM, RTLFixer (Tsai et al., 2023) leverages retrieval augmented generation (RAG) (Gao et al., 2023) and advanced prompt engineering (Yao et al., 2022), specifically tailored for HDL debugging tasks. VeriGen (Thakur et al., 2023) is a hardware large language model trained by a self-contained hardware dataset. Despite these advanced approaches, our observations indicate that none of the methods delivered results that met our criteria for satisfaction in the context of industry-level HDL debugging scenarios. A primary contributing factor to this shortfall is the insufficiency of HDL code resources for training. Consequently, these pre-trained LLMs struggle to accurately comprehend the syntax and functionality inherent to HDL codes.

To tackle the problem, we propose an HDL debugging framework, namely HDLdebugger, which consists of three components, i.e., data generation, search engine, and retrieval-augmented LLM fine-tuning. Firstly, the data generation procedure targets overcoming the obstacle of the limited availability of HDL bugs. Specifically, we employ reverse engineering to insert specific modifications into the original error-free code. Therefore, we can produce corresponding buggy versions and error messages via compilers, which are used to construct a code database and further fine-tune LLMs. Secondly, we propose an effective and efficient search engine, which is supported by the code database constructed by the data generation approach and the document database with various internal HDL documents which contain relevant information for buggy codes. Given a buggy code and its error message, the search engine retrieves relevant information (i.e., document RAG) and buggy codes (i.e., code RAG) with similar patterns from the document database and code database, respectively. The document RAG and code RAG are crucial for both the fine-tuning and inference stages of our retrieval-augmented LLM, enhancing the ability of LLMs to comprehensively understand the HDL buggy code and repair it effectively. Thirdly, to enhance the ability of LLMs to generate accurate code solutions, we propose a novel fine-tuning approach for LLMs. This approach incorporates a self-guided thought generation mechanism and a retrieval-augmented fine-tuning process, significantly improving the LLM’s performance in debugging HDL code.

Our contributions are summarized as follows:

  • We introduce an advanced LLM-based HDL debugging framework supporting chip designs in the industry, namely HDLdebugger, which consists of buggy data generation, search engine, and retrieval-augmented LLM fine-tuning.

  • To address the scarcity of high-quality HDL debugging training data, we propose a data generation approach based on reverse engineeering to comprehensively generate diverse and realistic HDL buggy codes with the correct version.

  • We propose a search engine to create code RAG (resp. doc RAG) for HDL buggy code (resp. relevant information) effectively and efficiently, enhancing the fine-tuning and inference of LLMs.

  • We present a novel retrieval-augmented fine-tuning approach for HDL debugging, which integrates self-guided thought generation with RAG-based fine-tuning strategies.

  • Extensive experiments on the HDL code dataset from Huawei demonstrate superior performance against 13 state-of-the-art baselines, including GPT4 and various HDL debugging LLMs.

2. Methodology

This section provides a comprehensive overview of the proposed HDLdebugger. Initially, we delve into the buggy code generation pipeline, as detailed in Section 2.1. Subsequently, the search engine mechanism tailored for Retrieval-Augmented Generation (RAG) is presented in Section 2.2. The Retrieval-Augmented LLM fine-tuning is elaborated upon in Section 2.3. The important notations in our paper are shown in Table 2.

Table 2. Important notations.
Notation Description
bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT Buggy code
ejsubscript𝑒𝑗e_{j}italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT Error id
misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT Error messages for bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT Correct code for bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
(dj,rj,sj)subscript𝑑𝑗subscript𝑟𝑗subscript𝑠𝑗(d_{j},r_{j},s_{j})( italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) Descriptions djsubscript𝑑𝑗d_{j}italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, reasons rjsubscript𝑟𝑗r_{j}italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and potential solutions sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for ejsubscript𝑒𝑗e_{j}italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT
De,Dcsubscript𝐷𝑒subscript𝐷𝑐D_{e},D_{c}italic_D start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT error database, code database
𝐳iwsubscriptsuperscript𝐳𝑤𝑖\mathbf{z}^{w}_{i}bold_z start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT Keyword vector for buggy code bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with error message misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
sim(Ii,Ij)𝑠𝑖𝑚subscript𝐼𝑖subscript𝐼𝑗sim(I_{i},I_{j})italic_s italic_i italic_m ( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) Similarity between buggy codes Iisubscript𝐼𝑖I_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Ijsubscript𝐼𝑗I_{j}italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT
q=(b,e)𝑞𝑏𝑒q=(b,e)italic_q = ( italic_b , italic_e ) Code query consisting of buggy code b𝑏bitalic_b and error message m𝑚mitalic_m
ragid𝑟𝑎subscriptsuperscript𝑔𝑑𝑖rag^{d}_{i}italic_r italic_a italic_g start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT The document RAG based on error message misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
ragic𝑟𝑎subscriptsuperscript𝑔𝑐𝑖rag^{c}_{i}italic_r italic_a italic_g start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT The code RAG based on buggy code bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT Prompt of thought generation
pcsubscript𝑝𝑐p_{c}italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT Prompt of buggy code correction
tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT The thought for solving buggy code bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

Framework. As shown in Fig. 1, given a buggy code and its associated error message, our proposed HDLdebugger targets to repair this buggy code into the correct one. Specifically, we first propose a data generation approach in Sec. 2.1 to generate a set of HDL code instances, where each instance consists of buggy code, error messages, and its correct version. These generated HDL code instances will be used to provide context for buggy code queries and fine-tune the LLMs. Second, we propose a search engine in Sec. 2.2, which targets to retrieve relevant text information(i.e., document RAG) for error messages and retrieve buggy codes with similar buggy patterns (i.e., code RAG). Then, HDLdebugger takes the buggy code, its error message, task prompt, document RAG, and code RAG to the LLMs, and enables LLMs to predict the correct code. Specifically, we introduce a retrieval-augmented fine-tune approach to fine-tune the LLMs for HDL code debugging in Sec. 2.3.

Refer to caption
Figure 1. Framework overview of HDLdebugger.
Refer to caption
Figure 2. Reverse Engineering Pipeline for inducing error to correct code by diverse modification functions summarized from HDL documents.

2.1. HDL Buggy Data Generation

Unlike software languages like Python or C++, where code can often be crawled from open websites and platforms like GitHub, HDL codes, particularly those used for chip testing, are rarely made public due to privacy and commercial concerns. This limitation presents a significant obstacle on fine-tuning LLMs in the domain of HDL. In this section, we will introduce a reverse engineering pipeline for generating high-quality HDL code pairs that consist of both buggy and corrected versions. As shown in Fig. 2, The HDL data generation consists of two steps, i.e., modification function generation and sample generation.

2.1.1. Modification Function Generation

The first step targets to generate a set of high-quality modification functions. These functions are then employed to modify HDL codes provided by industry engineers, thereby generating a diverse collection of buggy code examples. To ensure that the modified codes exhibit a range of realistic and diverse error patterns, we construct the modification functions by leveraging the capabilities of LLMs and a comprehensive collection of industrial HDL documents. Specifically, as shown in Fig. 2, we first collect a set of HDL documents, including the HDL manual, expert notes, and user logs. Subsequently, we then carefully design a prompt that guides the LLMs to extract and summarize prevalent and critical error patterns in HDL, such as syntax misuse or logical errors. With these insights, we proceed to develop the modification functions. The distilled functions focus on simple operations such as adding, deleting, modifying, and adjusting segments with HDL scripts, while the unique set of rules governing HDL ensures that similar operations can result in vastly different errors recorded in HDL documents. These functions explicitly introduce errors into the original correct HDL codes that mirror those commonly encountered errors in industry, thereby creating an invaluable dataset to fine-tune LLMs.

2.1.2. Sample Generation

In this step, we gather a broad range of accurate and high-quality HDL codes C={ci}i=1|C|𝐶superscriptsubscriptsubscript𝑐𝑖𝑖1𝐶C=\{c_{i}\}_{i=1}^{|C|}italic_C = { italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_C | end_POSTSUPERSCRIPT from experienced chip engineers. These codes, which have been utilized across various chip designs, are comprehensive to encompass a wide range of functional testing scenarios for chips. These various HDL codes serve as the seed code for error case construction. Specifically, by applying the previous modification functions to these correct HDL codes, we systematically introduce errors, thereby producing a set of buggy codes. Particularly, we can apply various modification functions to one HDL code, which allows us to generate multiple instances of buggy code. Next, we employ an HDL compiler to compile these intentionally buggy codes on their respective design, which inevitably results in compilation errors. Then, for each correct HDL code ciCsubscript𝑐𝑖𝐶c_{i}\in Citalic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_C, we can collect one of its buggy code bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with associated error message misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as an instance Ii=(bi,mi,ci)subscript𝐼𝑖subscript𝑏𝑖subscript𝑚𝑖subscript𝑐𝑖I_{i}=(b_{i},m_{i},c_{i})italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) of training data. Normally, diagnosing and rectifying HDL errors require the expertise of chip engineers. Given that our instructions systematically brought the errors, we can employ reverse engineering to identify the faults and produce solutions directly. This bypasses the need for manual error diagnosis, streaming the process of creating a vast array of comprehensive datasets of error scripts, error messages, and corresponding solutions. This method ensures a rich diversity in the types of errors produced, which is critical for creating an extensive and effective training dataset Dc={Ij}j=1|Dc|subscript𝐷𝑐superscriptsubscriptsubscript𝐼𝑗𝑗1subscript𝐷𝑐D_{c}=\{I_{j}\}_{j=1}^{|D_{c}|}italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = { italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT.

2.2. Search Engine for RAG

In this subsection, we propose a search engine to optimize retrieval-augmented generations (RAG) for retrieving relevant information in the HDL documents and codes instances. The retrieved RAG content will serve as contextual information for queries, thereby enhancing the capability of LLMs to understand buggy context information and identify issues within buggy codes. We illustrate the overview of search engine framework in Fig. 3.

Refer to caption
Figure 3. The search engine to retrieve relevant information given error messages and retrieve similar buggy codes based on a buggy code query.

2.2.1. Document RAG

As shown in Fig. 3 (a), we first collect a comprehensive collection of instructional documents for this HDL, encompassing language specifications, error diagnostics, and troubleshooting techniques. Then, we meticulously curate the content of documents, distilling a dedicated error database Desubscript𝐷𝑒D_{e}italic_D start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT tailored to this HDL. This error database contains detailed error descriptions djsubscript𝑑𝑗d_{j}italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, underlying reasons rjsubscript𝑟𝑗r_{j}italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and suggested remedial strategies sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for each error id ejsubscript𝑒𝑗e_{j}italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. We illustrate examples of error information in error database in Tab. 7 in the Appendix. Given an error message query misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we first parse misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to extract ne,isubscript𝑛𝑒𝑖n_{e,i}italic_n start_POSTSUBSCRIPT italic_e , italic_i end_POSTSUBSCRIPT constituent error codes {ej}j=1ne,isuperscriptsubscriptsubscript𝑒𝑗𝑗1subscript𝑛𝑒𝑖\{e_{j}\}_{j=1}^{n_{e,i}}{ italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_e , italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Subsequently, we retrieve the descriptions, reasons, and potential solutions for each identified error code from our error database. Then, the document RAG of the query misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be assembled as ragid={(ej,dj,rj,sj)}j=1ne,i𝑟𝑎superscriptsubscript𝑔𝑖𝑑superscriptsubscriptsubscript𝑒𝑗subscript𝑑𝑗subscript𝑟𝑗subscript𝑠𝑗𝑗1subscript𝑛𝑒𝑖rag_{i}^{d}=\{(e_{j},d_{j},r_{j},s_{j})\}_{j=1}^{n_{e,i}}italic_r italic_a italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT = { ( italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_e , italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, thereby hel** LLMs to understand the error message misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

2.2.2. Buggy Code RAG

As shown in Fig. 3 (c), in the code retrieval component, we maintain a code database Dc={Ii}i=1|Dc|subscript𝐷𝑐superscriptsubscriptsubscript𝐼𝑖𝑖1subscript𝐷𝑐D_{c}=\{I_{i}\}_{i=1}^{|D_{c}|}italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = { italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT, where each code instance Ii=(bi,mi,ci)subscript𝐼𝑖subscript𝑏𝑖subscript𝑚𝑖subscript𝑐𝑖I_{i}=(b_{i},m_{i},c_{i})italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) consists of a buggy code bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, its associated error messages misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and the correct code cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Given a query q=(b,m)𝑞𝑏𝑚q=(b,m)italic_q = ( italic_b , italic_m ) that includes a snippet of buggy code b𝑏bitalic_b with its associated error message e𝑒eitalic_e, the aim of the code RAG is to identify a subset DcqDcsuperscriptsubscript𝐷𝑐𝑞subscript𝐷𝑐D_{c}^{q}\subseteq D_{c}italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ⊆ italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT of the top-k𝑘kitalic_k code instances that have the most similar to the buggy code b𝑏bitalic_b in the query. In such a way, the buggy code b𝑏bitalic_b has similar buggy patterns with the buggy codes in each instance Ii=(bi,mi,ci)Dcqsubscript𝐼𝑖subscript𝑏𝑖subscript𝑚𝑖subscript𝑐𝑖subscriptsuperscript𝐷𝑞𝑐I_{i}=(b_{i},m_{i},c_{i})\in D^{q}_{c}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. Thus, the correct code cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each buggy code bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be provided to LLMs. As a result, LLMs can use the learned patterns from the top-k𝑘kitalic_k similar instances to fix for the buggy code b𝑏bitalic_b in the query. Specifically, we first introduce how to learn low-dimensional vectors for buggy codes and then then introduce a two-stage ranker that retrieves the top-k𝑘kitalic_k buggy code instances for code query q𝑞qitalic_q.

Vector Database Construction. Given the complexity and length of HDL code and associated error messages, as well as similar-looking code snippets containing distinct buggy patterns, computing the similarity between buggy codes is challenging. To address this challenge, we propose to measure the similarity between buggy from two aspects, i.e., keyword similarity and semantic similarity. First, we extract words from all buggy codes and their error messages and use the TF-IDF (Aizawa, 2003) technique to compute the weight of each word. Then, we can generate the keyword vector 𝐳iwsubscriptsuperscript𝐳𝑤𝑖\mathbf{z}^{w}_{i}bold_z start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each buggy code bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with its error message misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Second, we design a BERT-LSTM model that combines BERT, for its powerful language understanding capabilities, with an LSTM, for its sequential data processing strengths. The BERT-LSTM model encodes a buggy code bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with its error message misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into a low-dimensional embedding 𝐳issubscriptsuperscript𝐳𝑠𝑖\mathbf{z}^{s}_{i}bold_z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Specifically, as introduced in Sec. 2.1, the buggy codes are generated by different modification functions, and here we take these modification functions as the labels for each buggy and optimize the BERT-LSTM model. Then, we use the final representation of BERT-LSTM as the semantic embedding for each buggy code and its error messages. The architecture details of BERT-LSTM models are introduced in Appx. B.1. Based on the TF-IDF and BERT-LSTM models, we build a keyword vector database and a semantic vector database, which together facilitate a robust framework for analyzing the similarity between instances of buggy HDL code.

Two-stage Ranker. In general, we design a two-stage ranking approach to identify the top-k𝑘kitalic_k most relevant buggy code instances to a given query of buggy code. In the first-ranking stage, for any given buggy code query q=(b,e)𝑞𝑏𝑒q=(b,e)italic_q = ( italic_b , italic_e ), we first use the TF-IDF encoder and BERT-LSTM encoder to generate the keyword vector 𝐳iwsubscriptsuperscript𝐳𝑤𝑖\mathbf{z}^{w}_{i}bold_z start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and semantic vector 𝐳issubscriptsuperscript𝐳𝑠𝑖\mathbf{z}^{s}_{i}bold_z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Then, we define the similarity between the query code q𝑞qitalic_q and each instance IiDcsubscript𝐼𝑖subscript𝐷𝑐I_{i}\in D_{c}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT in the code database Dcsubscript𝐷𝑐D_{c}italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT as:

(1) sim(q,Ii)=λcosine(𝐳w,𝐳iw)+(1λ)cosine(𝐳s,𝐳is)+1,𝑠𝑖𝑚𝑞subscript𝐼𝑖𝜆𝑐𝑜𝑠𝑖𝑛𝑒superscript𝐳𝑤subscriptsuperscript𝐳𝑤𝑖1𝜆𝑐𝑜𝑠𝑖𝑛𝑒superscript𝐳𝑠subscriptsuperscript𝐳𝑠𝑖1sim(q,I_{i})=\lambda\cdot cosine(\mathbf{z}^{w},\mathbf{z}^{w}_{i})+(1-\lambda% )\cdot cosine(\mathbf{z}^{s},\mathbf{z}^{s}_{i})+1,italic_s italic_i italic_m ( italic_q , italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_λ ⋅ italic_c italic_o italic_s italic_i italic_n italic_e ( bold_z start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT , bold_z start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + ( 1 - italic_λ ) ⋅ italic_c italic_o italic_s italic_i italic_n italic_e ( bold_z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , bold_z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + 1 ,

where λ[0,1]𝜆01\lambda\in[0,1]italic_λ ∈ [ 0 , 1 ] is a hyper-parameter between semantic similarity and keyword similarity and sim(q,Ii)[0,2]𝑠𝑖𝑚𝑞subscript𝐼𝑖02sim(q,I_{i})\in[0,2]italic_s italic_i italic_m ( italic_q , italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ [ 0 , 2 ]. We select the top-N𝑁Nitalic_N similar instances D^cqsubscriptsuperscript^𝐷𝑞𝑐\hat{D}^{q}_{c}over^ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT from the code database Dcsubscript𝐷𝑐D_{c}italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT based on the similarity score. In the second-ranking stage, our goal is to pinpoint the top-k𝑘kitalic_k relevant yet diverse buggy codes. These selections are aimed at providing a broader range of buggy patterns, which can help LLM repair the bugs in the query. Formally, given N𝑁Nitalic_N buggy instances D^cqsubscriptsuperscript^𝐷𝑞𝑐\hat{D}^{q}_{c}over^ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and the query buggy code q=(b,e)𝑞𝑏𝑒q=(b,e)italic_q = ( italic_b , italic_e ), we select the top-k𝑘kitalic_k relevant and diverse code Dcqsubscriptsuperscript𝐷𝑞𝑐D^{q}_{c}italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT by maximizing the following objective:

(2) maxDcqD^cqIiDcqsim(q,Ii)+1kIiDcqdis(Ii,Dcq),subscriptsubscriptsuperscript𝐷𝑞𝑐subscriptsuperscript^𝐷𝑞𝑐subscriptsubscript𝐼𝑖subscriptsuperscript𝐷𝑞𝑐𝑠𝑖𝑚𝑞subscript𝐼𝑖1𝑘subscriptsubscript𝐼𝑖subscriptsuperscript𝐷𝑞𝑐𝑑𝑖𝑠subscript𝐼𝑖subscriptsuperscript𝐷𝑞𝑐\displaystyle\max_{D^{q}_{c}\subseteq\hat{D}^{q}_{c}}{\sum_{I_{i}\in D^{q}_{c}% }{sim(q,I_{i})}+\frac{1}{k}\cdot\sum_{I_{i}\in D^{q}_{c}}{dis(I_{i},D^{q}_{c})% }},roman_max start_POSTSUBSCRIPT italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ⊆ over^ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s italic_i italic_m ( italic_q , italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ⋅ ∑ start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_d italic_i italic_s ( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ,

where distance dis(Ii,Dcq)=minIjDcqIi(2sim(Ii,Ij))𝑑𝑖𝑠subscript𝐼𝑖subscriptsuperscript𝐷𝑞𝑐subscriptsubscript𝐼𝑗subscriptsuperscript𝐷𝑞𝑐subscript𝐼𝑖2𝑠𝑖𝑚subscript𝐼𝑖subscript𝐼𝑗dis(I_{i},D^{q}_{c})=\min_{I_{j}\in D^{q}_{c}\setminus I_{i}}{(2-sim(I_{i},I_{% j}))}italic_d italic_i italic_s ( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) = roman_min start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∖ italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 2 - italic_s italic_i italic_m ( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) denotes the diversity value between each instance Iisubscript𝐼𝑖I_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the other instances in Dcqsubscriptsuperscript𝐷𝑞𝑐D^{q}_{c}italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and dis(Ii,Dcq)[0,2]𝑑𝑖𝑠subscript𝐼𝑖subscriptsuperscript𝐷𝑞𝑐02dis(I_{i},D^{q}_{c})\in[0,2]italic_d italic_i italic_s ( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ∈ [ 0 , 2 ]. Eq. (2) is an NP-hard problem, which can be reduced from the well-known k𝑘kitalic_k-clique problem (Tsourakakis, 2015) by setting sim(q,Ii)=1𝑠𝑖𝑚𝑞subscript𝐼𝑖1sim(q,I_{i})=1italic_s italic_i italic_m ( italic_q , italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = 1 for all IiDcsubscript𝐼𝑖subscript𝐷𝑐I_{i}\in D_{c}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. Therefore, we propose a greedy algorithm with an approximation ratio 11/e11𝑒1-1/e1 - 1 / italic_e to identify the top-k𝑘kitalic_k relevant buggy codes. The details of the greedy algorithm and approximation ratio are introduced in Appx. B.2. Thus, given a buggy code bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and error message misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we can generate the code retrieval-augmented generation ragic={(bj,mj,cj)}j=1k𝑟𝑎subscriptsuperscript𝑔𝑐𝑖superscriptsubscriptsubscript𝑏𝑗subscript𝑚𝑗subscript𝑐𝑗𝑗1𝑘rag^{c}_{i}=\{(b_{j},m_{j},c_{j})\}_{j=1}^{k}italic_r italic_a italic_g start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { ( italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT.

2.3. Retrieval-augmented LLM Fine-tuning

In this subsection, we introduce how to fine-tune the LLMs based on the training dataset constructed in Sec. 2.1 and the search engine proposed in Sec. 2.2. Specifically, we first propose to generate a thought to help LLM repair each buggy code in the training dataset. Second, based on the generated thought and the retrieved buggy codes, we propose a retrieval-augmented supervised fine-tuning technique for LLMs.

2.3.1. Self-guided Thought Generation

One straightforward way is to feed buggy code and error messages to LLMs and let LLMs directly predict the correct codes. However, this approach is insufficient for LLMs to deeply comprehend the problem and provide accurate solutions. Recent research (Wei et al., 2022; Yao et al., 2023) suggests that when LLMs are prompted to produce a series of intermediate and explanatory thought before finally outputting the solution to the given task, the performance of LLMs can be significantly improved. It is because these reasoning thoughts improve the understanding of LLMs on the input tasks and thus generate more relevant and accurate outputs. Therefore, before fine-tuning the LLMs, we propose to generate high-quality thought for each training code instance.

Specifically, as illustrated in Fig. 4, we first design a precise and explicit prompt ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to clarify the thought generation task for LLMs. Following this, we input the thought generation prompt ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the buggy code bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, its associated error message misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and the document RAG ragid𝑟𝑎superscriptsubscript𝑔𝑖𝑑rag_{i}^{d}italic_r italic_a italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, and its correct version cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into the LLM with its inference mode LLM1𝐿𝐿subscript𝑀1LLM_{1}italic_L italic_L italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. This enables the LLM to generate a thought tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT on how to repair the buggy code bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into the correct code cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as follows:

(3) ti=LLM1(ptbimiciragid),subscript𝑡𝑖𝐿𝐿subscript𝑀1subscript𝑝𝑡subscript𝑏𝑖subscript𝑚𝑖subscript𝑐𝑖𝑟𝑎superscriptsubscript𝑔𝑖𝑑t_{i}=LLM_{1}(p_{t}\circ b_{i}\circ m_{i}\circ c_{i}\circ rag_{i}^{d}),italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_L italic_L italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∘ italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ italic_r italic_a italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) ,

where \circ means concatenate operator. We omit the general task requirements prompt for simplification. Empirically, we find that when LLMs generate the thought tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT only once, the output might be irrelevant to the buggy code bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT or incorrect due to the hallucination phenomenon and randomness of LLMs. Therefore, to guarantee the quality of the generated thought, we iteratively generate L𝐿Litalic_L different thoughts Ti={ti,j}j=1Lsubscript𝑇𝑖superscriptsubscriptsubscript𝑡𝑖𝑗𝑗1𝐿T_{i}=\{t_{i,j}\}_{j=1}^{L}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_t start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT for each buggy code bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT by running LLM1𝐿𝐿subscript𝑀1LLM_{1}italic_L italic_L italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT L𝐿Litalic_L times following Eq. (3). This is achieved by modifying temperature parameters and the details are described in appendix.

To assess the quality of each thought ti,jTisubscript𝑡𝑖𝑗subscript𝑇𝑖t_{i,j}\in T_{i}italic_t start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we adopt a self-guidance strategy to select the highest quality thought from the thought set Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for the buggy code bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Specifically, we feed the buggy code bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, its associated error message misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and the document RAG ragid𝑟𝑎superscriptsubscript𝑔𝑖𝑑rag_{i}^{d}italic_r italic_a italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT to the LLM. A prompt pcsubscript𝑝𝑐p_{c}italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, ”Based on the analysis, the correct script is,” is appended to guide the LLM towards generating the predicted correct script, represented as:

(4) c^i,j=LLM1(pcbimiragidti,j).subscript^𝑐𝑖𝑗𝐿𝐿subscript𝑀1subscript𝑝𝑐subscript𝑏𝑖subscript𝑚𝑖𝑟𝑎superscriptsubscript𝑔𝑖𝑑subscript𝑡𝑖𝑗\hat{c}_{i,j}=LLM_{1}(p_{c}\circ b_{i}\circ m_{i}\circ rag_{i}^{d}\circ t_{i,j% }).over^ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = italic_L italic_L italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∘ italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ italic_r italic_a italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∘ italic_t start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) .

After we obtain the output c^i,jsubscript^𝑐𝑖𝑗\hat{c}_{i,j}over^ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT for each thought ti,jsubscript𝑡𝑖𝑗t_{i,j}italic_t start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT, we employ the edit distance metric (Navarro, 2001)to evaluate the similarity between the predicted correct code c^i,jsubscript^𝑐𝑖𝑗\hat{c}_{i,j}over^ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT and the ground-truth of corrected code cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as

(5) di,j=EditDistance(ci,c^i,j).subscript𝑑𝑖𝑗𝐸𝑑𝑖𝑡𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒subscript𝑐𝑖subscript^𝑐𝑖𝑗d_{i,j}=EditDistance(c_{i},\hat{c}_{i,j}).italic_d start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = italic_E italic_d italic_i italic_t italic_D italic_i italic_s italic_t italic_a italic_n italic_c italic_e ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) .

Intuitively, if the distance di,jsubscript𝑑𝑖𝑗d_{i,j}italic_d start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT between the predicted correct code c^i,jsubscript^𝑐𝑖𝑗\hat{c}_{i,j}over^ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT and the ground truth of corrected code cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is smaller, the thought ti,jsubscript𝑡𝑖𝑗t_{i,j}italic_t start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is more helpful to LLMs to repair the buggy code bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Therefore, we select the thought with the smallest edit distances for each bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, i.e., ti=minti,jTidi,jsubscript𝑡𝑖subscriptsubscript𝑡𝑖𝑗subscript𝑇𝑖subscript𝑑𝑖𝑗t_{i}=\min_{t_{i,j}\in T_{i}}{d_{i,j}}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_min start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT. Finally, we can obtain the training dataset 𝒟={(bi,mi,ragid,ragic,ti,ci)}i=1|Dc|𝒟superscriptsubscriptsubscript𝑏𝑖subscript𝑚𝑖𝑟𝑎superscriptsubscript𝑔𝑖𝑑𝑟𝑎subscriptsuperscript𝑔𝑐𝑖subscript𝑡𝑖subscript𝑐𝑖𝑖1subscript𝐷𝑐\mathcal{D}=\{(b_{i},m_{i},rag_{i}^{d},rag^{c}_{i},t_{i},c_{i})\}_{i=1}^{|D_{c% }|}caligraphic_D = { ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r italic_a italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , italic_r italic_a italic_g start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT. The details of the thought generation are illustrated in Alg. 2 in the Appendix.

2.3.2. Retrieval-augmented Fine-tuning

After obtaining the final training dataset 𝒟={(bi,mi,ragid,ragic,ti,ci)}i=1|Dc|𝒟superscriptsubscriptsubscript𝑏𝑖subscript𝑚𝑖𝑟𝑎superscriptsubscript𝑔𝑖𝑑𝑟𝑎subscriptsuperscript𝑔𝑐𝑖subscript𝑡𝑖subscript𝑐𝑖𝑖1subscript𝐷𝑐\mathcal{D}=\{(b_{i},m_{i},rag_{i}^{d},rag^{c}_{i},t_{i},c_{i})\}_{i=1}^{|D_{c% }|}caligraphic_D = { ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r italic_a italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , italic_r italic_a italic_g start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT, we supervise fine-tune the LLMs based on buggy codes and retrieval-augmented generation in Sec. 2.2. Specifically, given each training instance Di=(bi,mi,ragid,ragic,ti,ci)𝒟subscript𝐷𝑖subscript𝑏𝑖subscript𝑚𝑖𝑟𝑎superscriptsubscript𝑔𝑖𝑑𝑟𝑎subscriptsuperscript𝑔𝑐𝑖subscript𝑡𝑖subscript𝑐𝑖𝒟D_{i}=(b_{i},m_{i},rag_{i}^{d},rag^{c}_{i},t_{i},c_{i})\in\mathcal{D}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r italic_a italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , italic_r italic_a italic_g start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ caligraphic_D, we first feed thought generation prompt ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, buggy code bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, its error message misubscript𝑚𝑖m_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, document RAG ragid𝑟𝑎superscriptsubscript𝑔𝑖𝑑rag_{i}^{d}italic_r italic_a italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, code RAG ragic𝑟𝑎subscriptsuperscript𝑔𝑐𝑖rag^{c}_{i}italic_r italic_a italic_g start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, to a target LLM, LLM2𝐿𝐿subscript𝑀2LLM_{2}italic_L italic_L italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to generate the predicted thought t^isubscript^𝑡𝑖\hat{t}_{i}over^ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as follows:

(6) t^i=LLM2(ptpcbimiragidragic).subscript^𝑡𝑖𝐿𝐿subscript𝑀2subscript𝑝𝑡subscript𝑝𝑐subscript𝑏𝑖subscript𝑚𝑖𝑟𝑎superscriptsubscript𝑔𝑖𝑑𝑟𝑎subscriptsuperscript𝑔𝑐𝑖\hat{t}_{i}=LLM_{2}(p_{t}\circ p_{c}\circ b_{i}\circ m_{i}\circ rag_{i}^{d}% \circ rag^{c}_{i}).over^ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_L italic_L italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∘ italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∘ italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ italic_r italic_a italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∘ italic_r italic_a italic_g start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

Then, based on the predicted thought p^tsubscript^𝑝𝑡\hat{p}_{t}over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and code correction prompt pcsubscript𝑝𝑐p_{c}italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, LLM LLM2𝐿𝐿subscript𝑀2LLM_{2}italic_L italic_L italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT predicts the correct code c^isubscript^𝑐𝑖\hat{c}_{i}over^ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as follows:

(7) c^i=LLM2(ptpcbimiragidragict^i).subscript^𝑐𝑖𝐿𝐿subscript𝑀2subscript𝑝𝑡subscript𝑝𝑐subscript𝑏𝑖subscript𝑚𝑖𝑟𝑎superscriptsubscript𝑔𝑖𝑑𝑟𝑎subscriptsuperscript𝑔𝑐𝑖subscript^𝑡𝑖\hat{c}_{i}=LLM_{2}(p_{t}\circ p_{c}\circ b_{i}\circ m_{i}\circ rag_{i}^{d}% \circ rag^{c}_{i}\circ\hat{t}_{i}).over^ start_ARG italic_c end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_L italic_L italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∘ italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∘ italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ italic_r italic_a italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∘ italic_r italic_a italic_g start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ over^ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

Following (Wang et al., 2022; Taori et al., 2023), we fine-tune the target LLM with its training mode LLM2𝐿𝐿subscript𝑀2LLM_{2}italic_L italic_L italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT by using the conventional next-token prediction objective and minimize the cross-entropy loss 𝒟subscript𝒟\mathcal{L}_{\mathcal{D}}caligraphic_L start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT:

ti=wi,jtilogpLLM2(wi,j|Cwi,<j),subscriptsubscript𝑡𝑖subscriptsubscript𝑤𝑖𝑗subscript𝑡𝑖subscript𝑝𝐿𝐿subscript𝑀2conditionalsubscript𝑤𝑖𝑗𝐶subscript𝑤𝑖absent𝑗\displaystyle\mathcal{L}_{{t}_{i}}=\sum_{w_{i,j}\in t_{i}}\log p_{LLM_{2}}(w_{% i,j}|C\circ w_{i,<j}),caligraphic_L start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_L italic_L italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | italic_C ∘ italic_w start_POSTSUBSCRIPT italic_i , < italic_j end_POSTSUBSCRIPT ) ,
ci=wi,kcilogpLLM2(wi,k|Ct^iwi,<k),subscriptsubscript𝑐𝑖subscriptsubscript𝑤𝑖𝑘subscript𝑐𝑖subscript𝑝𝐿𝐿subscript𝑀2conditionalsubscript𝑤𝑖𝑘𝐶subscript^𝑡𝑖subscript𝑤𝑖absent𝑘\displaystyle\mathcal{L}_{{c}_{i}}=\sum_{w_{i,k}\in c_{i}}\log p_{LLM_{2}}(w_{% i,k}|C\circ\hat{t}_{i}\circ w_{i,<k}),caligraphic_L start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ∈ italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_L italic_L italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT | italic_C ∘ over^ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ italic_w start_POSTSUBSCRIPT italic_i , < italic_k end_POSTSUBSCRIPT ) ,
(8) 𝒟=12|𝒟|Di𝒟(ti+ci),subscript𝒟12𝒟subscriptsubscript𝐷𝑖𝒟subscriptsubscript𝑡𝑖subscriptsubscript𝑐𝑖\displaystyle\mathcal{L}_{\mathcal{D}}=\frac{1}{2|\mathcal{D}|}\sum_{D_{i}\in% \mathcal{D}}{(\mathcal{L}_{{t}_{i}}+\mathcal{L}_{{c}_{i}})},caligraphic_L start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 | caligraphic_D | end_ARG ∑ start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_D end_POSTSUBSCRIPT ( caligraphic_L start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ,

where the context C=ptpcbimiragidragic𝐶subscript𝑝𝑡subscript𝑝𝑐subscript𝑏𝑖subscript𝑚𝑖𝑟𝑎superscriptsubscript𝑔𝑖𝑑𝑟𝑎subscriptsuperscript𝑔𝑐𝑖C=p_{t}\circ p_{c}\circ b_{i}\circ m_{i}\circ rag_{i}^{d}\circ rag^{c}_{i}italic_C = italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∘ italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∘ italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∘ italic_r italic_a italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∘ italic_r italic_a italic_g start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the concatenate of the inputs for clarification wi,jtisubscript𝑤𝑖𝑗subscript𝑡𝑖w_{i,j}\in t_{i}italic_w start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the j𝑗jitalic_j-th word in tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and wi,<jsubscript𝑤𝑖absent𝑗w_{i,<j}italic_w start_POSTSUBSCRIPT italic_i , < italic_j end_POSTSUBSCRIPT denotes a set of words in tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT before wi,jsubscript𝑤𝑖𝑗w_{i,j}italic_w start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT, and pLLM2(wi,j|wi,<j)subscript𝑝𝐿𝐿subscript𝑀2conditionalsubscript𝑤𝑖𝑗subscript𝑤𝑖absent𝑗p_{LLM_{2}}(w_{i,j}|w_{i,<j})italic_p start_POSTSUBSCRIPT italic_L italic_L italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | italic_w start_POSTSUBSCRIPT italic_i , < italic_j end_POSTSUBSCRIPT ) denote the probability of wi,jsubscript𝑤𝑖𝑗w_{i,j}italic_w start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT. tisubscriptsubscript𝑡𝑖\mathcal{L}_{{t}_{i}}caligraphic_L start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and cisubscriptsubscript𝑐𝑖\mathcal{L}_{{c}_{i}}caligraphic_L start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT denote the prediction cross-entropy loss on thought ground truth tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and correct code ground truth cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT regarding the buggy code bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, respectively.

Refer to caption
Figure 4. Thought generation example.

3. Experiments

3.1. Experiment Setting

3.1.1. HDL Datasets in Huawei

We gather a diverse collection of HDL code files from Huawei. These files are meticulously curated, with each file being specifically utilized for distinct chip design scenarios, reflecting the varied and specialized demands of circuit design. Each HDL code file contains an extensive array of data, including variable assignments, detailed circuit designs, clocking information, function testing protocols, and various test items that are critical for the chip design process. Based on these HDL code files, we use the data generation in Sec. 2.1 to generate 92,143 distinct HDL training code instances. Each HDL training code instance consists of the buggy HDL code, error messages, and the correct HDL code. Specifically, in the experiments, we split the data into training and testing sets at a ratio of 8:2, respectively.

3.1.2. Baselines

We compare our proposed HDLdebugger with 13 baselines in three types of code debugging approaches as follows:

  • Five large language models: We compare HDLdebugger against two profound and state-of-the-art LLMs available through API services, including ChatGPT (Brown et al., 2020) and GPT-4 (Achiam et al., 2023). Additionally, we compare HDLdebugger with three open-source LLMs: OpenChat (Wang et al., 2023a), Orca2 (Mitra et al., 2023), and Mistral (Jiang et al., 2023). These open-source models have shown performance on par with ChatGPT across various open LLM benchmarks.

  • Four Code Debugging and HDL Models: We compare HDLdebugger with two code debugging and hardware code generation models, i.e., Self-debug (Chen et al., 2023) and RTLfixer (Tsai et al., 2023). Self-debug (Chen et al., 2023) is one of the most classical methods of code debugging. RTLfixer (Tsai et al., 2023) is proposed to solve HDL debugging problems. VeriGen (Thakur et al., 2023) and RTLCoder (Liu et al., 2023b) are two LLMs targeting hardware language.

  • Four Code Language Models: We compare four SOTA pretrained code LLMs, i.e., Deepseek (Bi et al., 2024), Starcoder (Li et al., 2023), Stablecode (Rombach et al., 2022), and WizardCoder (Luo et al., 2023).

For baselines except for ChatGPT (Brown et al., 2020) and GPT-4 (Achiam et al., 2023)., we adopt three strategies, i.e., the raw model, the raw model with RAG, and the raw model with supervised fine-tuning (SFT).

  • Raw model: We only feed buggy code and error messages to the model and enable raw models to infer the correct code directly.

  • Raw Model with RAG: For buggy code, we feed the buggy code, its error message, and the document RAG and code RAG obtained in Sec. 2.2 to the raw model and enable the raw models to infer the correct code directly.

  • Raw Model with SFT: We take the buggy code and error message as inputs of LLMs and use the correct code as ground-truth to fine-tune the raw models.

Specifically, we selected CodeLlama-13b as our base model from the available code LLMs. The choice of 13b was driven by its optimal model size, which strikes a balance between training and deployment, taking into account both performance and cost factors.

3.1.3. Evaluation Metrics

For the overall debug system, we mainly evaluate its pass rate for correcting codes, relative code runtimes, and edit distance between correct code and buggy code. The calculation of these metrics is listed below.

  • Pass-Rate: Pass rate for executing code file corrected by each method is defined as P=i=1nc𝕊(yi)nc𝑃superscriptsubscript𝑖1subscript𝑛𝑐𝕊subscript𝑦𝑖subscript𝑛𝑐P=\frac{\sum_{i=1}^{n_{c}}\mathbb{S}(y_{i})}{n_{c}}italic_P = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_S ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG, where yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT stands for the corrected code, 𝕊𝕊\mathbb{S}blackboard_S denotes for executing code successfully.

  • Run-Time: The relatively average compilation time for Huawei’s internal HDL compiler to execute all test code files. The Run-Time for results from our HDLdebugger is set as the base unit.

  • Edit-Distance: Edit-Distance calculates the minimum number of operations (insertion, deletion, substitution) required to transform one code snippet into the other.

Also, we evaluate our code search engine in Sec. 2.2 by the hit ratio, mean average precision, and mean reciprocal rank metrics, which are formulated as follows.

  • H@K: Hit ratio for top-K𝐾Kitalic_K recommendation results on ntsubscript𝑛𝑡n_{t}italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT code queries is formulated as H@K=1nti=1nt1Kk=1K𝕀(yi,k,yi)𝐻@𝐾1subscript𝑛𝑡superscriptsubscript𝑖1subscript𝑛𝑡1𝐾superscriptsubscript𝑘1𝐾𝕀subscript𝑦𝑖𝑘subscript𝑦𝑖H@K=\frac{1}{n_{t}}\sum_{i=1}^{n_{t}}\frac{1}{K}\sum_{k=1}^{K}\mathbb{I}(y_{i,% k},y_{i})italic_H @ italic_K = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT blackboard_I ( italic_y start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where yi,ksubscript𝑦𝑖𝑘y_{i,k}italic_y start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT denotes the error type of the retrieved k𝑘kitalic_k-th buggy code for query code bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and the indicator function 𝕀(yi,k,yi)=1𝕀subscript𝑦𝑖𝑘subscript𝑦𝑖1\mathbb{I}(y_{i,k},y_{i})=1blackboard_I ( italic_y start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = 1 if yi,k=yisubscript𝑦𝑖𝑘subscript𝑦𝑖y_{i,k}=y_{i}italic_y start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

  • MAP@K: Mean average precision (MAP) for top-K𝐾Kitalic_K results on ntsubscript𝑛𝑡n_{t}italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT queries is defined as MAP@K=1nti=1nt1Kk=1K𝕀(yi,k,yi)n(bi,k)k𝑀𝐴𝑃@𝐾1subscript𝑛𝑡superscriptsubscript𝑖1subscript𝑛𝑡1𝐾superscriptsubscript𝑘1𝐾𝕀subscript𝑦𝑖𝑘subscript𝑦𝑖𝑛subscript𝑏𝑖absent𝑘𝑘MAP@K=\frac{1}{n_{t}}\sum_{i=1}^{n_{t}}\frac{1}{K}\sum_{k=1}^{K}\frac{\mathbb{% I}(y_{i,k},y_{i})\cdot n(b_{i,\leq k})}{k}italic_M italic_A italic_P @ italic_K = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG blackboard_I ( italic_y start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ italic_n ( italic_b start_POSTSUBSCRIPT italic_i , ≤ italic_k end_POSTSUBSCRIPT ) end_ARG start_ARG italic_k end_ARG, where n(bi,k)𝑛subscript𝑏𝑖absent𝑘n(b_{i,\leq k})italic_n ( italic_b start_POSTSUBSCRIPT italic_i , ≤ italic_k end_POSTSUBSCRIPT ) denotes the number of recommendations in the first top-k𝑘kitalic_k that has the same error label with query bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

  • MRR@K: Mean reciprocal rank (MRR) for top-K𝐾Kitalic_K results on ntsubscript𝑛𝑡n_{t}italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT queries is formulated as MRR@K=1nti=1nt1Kk=1K𝕀(yi,k,yi)k𝑀𝑅𝑅@𝐾1subscript𝑛𝑡superscriptsubscript𝑖1subscript𝑛𝑡1𝐾superscriptsubscript𝑘1𝐾𝕀subscript𝑦𝑖𝑘subscript𝑦𝑖𝑘MRR@K=\frac{1}{n_{t}}\sum_{i=1}^{n_{t}}\frac{1}{K}\sum_{k=1}^{K}\frac{\mathbb{% I}(y_{i,k},y_{i})}{k}italic_M italic_R italic_R @ italic_K = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG blackboard_I ( italic_y start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_k end_ARG.

Table 3. Main Results. Pass-Rate describes the absolute value, which is the higher the better. <1%absentpercent1<1\%< 1 % describes the Pass-Rate less than 1%. Run-Time and Edit-Distance are relative values compared with ours, which are both the lower the better.
Method Pass-Rate Run-Time Edit-Distance
ChatGPT*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 3.01% 2.25 31.28
GPT4*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 6.35% 1.94 18.17
OpenChat ¡1% 2.51 34.47
OpenChat w/ RAG 3.01% 2.21 10.37
Orca2 2.68% 2.08 2.94
Orca2 w/ RAG 9.03% 2.10 4.74
Mistral 7.36% 2.39 6.68
Mistral w/ RAG 24.75% 2.01 9.88
Self-debug 5.02% 2.49 10.24
RTLfixer 28.35% 2.11 11.05
VeriGen ¡1% 2.46 49.16
VeriGen w/ RAG 1.34% 2.56 37.12
VeriGen w/ SFT 67.55% 1.35 1.38
RTLCoder ¡1% 2.49 43.08
RTLCoder w/ RAG ¡1% 2.65 34.85
RTLCoder w/ SFT 64.21% 1.53 3.83
Deepseek 2.34% 2.58 32.17
Deepseek w/ RAG 3.34% 2.51 34.38
Deepseek w/ SFT 51.63% 1.64 3.35
Starcoder ¡1% 2.58 28.85
Starcoder w/ RAG ¡1% 2.56 34.85
Starcoder w/ SFT 68.27% 1.21 1.57
Stablecode 6.69% 2.46 4.56
Stablecode w/ RAG 8.02% 2.50 6.25
Stablecode w/ SFT 41.47% 2.38 6.10
WizardCoder 3.01% 2.41 7.19
WizardCoder w/ RAG 4.68% 2.58 8.67
WizardCoder w/ SFT 71.57% 1.04 1.06
HDLdebugger(ours) 81.93% 1.00 1.00

3.2. Main Results

Table 3 demonstrates the main performance of our results and other methods. It’s clear that our method outperforms other methods by all means by a large margin including direct approach, RAG and SFT, which demonstrates the effectiveness of our framework. For both runtime and edit distance metrics, we normalize all results and present only the relative values in comparison with ours to enhance the clarity and effectiveness of the comparison.

3.2.1. Comparison with different types of LLMs.

In our study, we conducted a comparative evaluation of both general-purpose LLMs such as ChatGPT, GPT-4, OpenChat, Orca, Mistral, Deepseek, Starcoder, Stablecode, WizardCoder, and specialized code LMs including Self-debug, RTLfixer, VeriGen, RTLCoder, within the HDL debugging scenario. Due to privacy concerns, certain code specifications have been omitted for testing purposes when using ChatGPT or GPT4. To distinguish these versions, we will refer to them as ChatGPT*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT and GPT4*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT. It is evident that our approach exhibits superior performance against all 13 state-of-the-art benchmarks, including GPT-4 and other domain-specific hardware-based language models.

3.2.2. Analysis of Different Strategies.

In this study, we implement three distinct evaluation strategies to assess the efficacy of various methodologies: with retrieval-augmented generation (RAG), with supervised fine-tuning (SFT), and via a direct approach. Within the context of HDL debugging, our analysis reveals that SFT holds greater significance and applicability across all evaluated baselines, including VeriGen, RTLCoder, Deepseek, Starcoder, Stablecode, and WizardCoder. In the majority of scenarios, we note that methods enhanced through SFT consistently outperform those augmented with RAG by a substantial margin. Besides, our proposed method integrates both RAG and SFT strategies, achieving unparalleled performance, indicating that it is better to incorporate SFT and RAG for the HDL debugging task.

3.2.3. Impact on Domain-Specific Solutions.

In addition to general and code-specific Language Models (LLMs), methodologies such as Self-debug and RTLfixer are specifically devised to address code debugging scenarios. While these approaches demonstrate improvements over other general and code LLMs, their effectiveness still falls short of being fully satisfactory. Our analysis extends to evaluating our dataset with LLMs exclusively trained on hardware languages, namely VeriGen and RTLCoder. Contrary to expectations, these specialized hardware language models do not outperform their general and code LLM counterparts in our HDL debugging context, suggesting the possibility of an inherent task domain generalization issue within HDL debugging scenarios. On the other hand, our approach consistently surpasses domain-specific solutions regardless of the varied prompt engineering techniques employed or the domain-specific data used for training, which underscores the effectiveness of our methodology.

3.3. Ablation Studies

Firstly, we provide a detailed analysis of the impact of different strategies of our method.  Table 4 illustrates the performance of different strategies. For baseline ,we only use direct inference strategy on base model, i.e., CodeLlama. SFT w/ th indicates supervised fine-tuning with generated thoughts. In terms of RAG & SFT w/ th, we mainly refer to retrieval augmented LLM fine-tuning where both retrieved code instances and relevant information are combined together for LLM fine-tuning. From Table 4 we can observe that SFT w/ th outperforms baseline by a large margin. Moreover, combing RAG &\&& SFT also significantly improves the performance.

Table 4. Ablation on different strategies
Method Pass-Rate Run-Time Edit-Distance
Direct(CodeLlama) 4.01% 2.26 35.51
+ RAG 15.05% 2.20 5.34
+ SFT 70.56% 1.19 1.19
+ SFT w/ th 74.91% 1.09 1.13
+ RAG & SFT w/ th 81.93% 1.00 1.00

Besides, we evaluate our search engine for RAG, comparing it with four methods: F-IDF (Aizawa, 2003), BM25 (Robertson et al., 2004), random forest (Rigatti, 2017), and XGBoost (Chen and Guestrin, 2016). TF-IDF and BM25 assess relevance scores between buggy codes, while Random Forest and XGBoost generate low-dimensional representations for codes and error messages, similar to our BERT-LSTM model’s approach to computing relevance through representation similarity.

Table 5 indicates our engine surpasses all baselines in accurately retrieving and correcting buggy code queries, highlighting its superior ability to decode complex buggy code patterns beyond the capabilities of traditional and machine learning models. Traditional methods like TF-IDF and BM25 lack the depth to understand complex code bugs, while models like XGBoost and Random Forest fall short in semantic comprehension. Our engine effectively combines textual and semantic analysis, enhancing bug detection and correction.

Table 5. Evaluation on the search engine in Sec. 2.2 on the top-k𝑘kitalic_k recommendation. XGB and RF denote the XFBoost and Random Forest, respectively. MAP@@@@1 and MRR@@@@1 are the same as H@@@@1 mathematically. The first row Optimal is the optimal performance under each metric.
H@1 H@3 H@10 MAP@3 MAP@10 MRR@3 MRR@10
Optimal 1.00 1.00 1.00 1.00 1.00 0.61 0.29
TF-IDF 0.97 0.87 0.75 0.86 0.71 0.55 0.24
BM25 0.95 0.80 0.59 0.81 0.54 0.53 0.22
XGB 1.00 0.35 0.18 0.34 0.14 0.34 0.12
RF 0.31 0.43 0.41 0.34 0.29 0.25 0.11
Ours 1.00 0.98 0.94 0.98 0.93 0.60 0.28

3.4. Parameter Sensitivity

In the following experiments, we evaluate parameter sensitivity across different hyper-parameters including retrieved code instances samples and related inference time. We also consider temperature and pass-rate@k@𝑘@k@ italic_k for various k𝑘kitalic_k.

Figure 5. Pass-rate and code retrieved code instances
Refer to caption
Refer to caption
Figure 5. Pass-rate and code retrieved code instances
Figure 6. Inference time and retrieved code instances
Refer to caption
Figure 7. Pass-rate@@@@k and Temperature

3.4.1. The number of retrieved code instances

Figure 6 depicts the impact of varying the number of retrieved code instances on model performance. We find that performance improves with each additional instance between 1 to 5, achieving the most significant gains. However, beyond five instances, gains plateau or even decrease, indicating diminishing returns. This suggests that while adding retrieved code instances enhances model performance up to a point, increasing instances beyond this threshold leads to inefficiency. Moreover, we observe a tendency for the language model to generate repetitive or redundant content derived from previous input as the number of retrieved code instances increases. This phenomenon underscores a critical area for future exploration.

3.4.2. Feasibility on Inference

We assess the impact of incorporating the retrieved code instances on the additional inference budget.  Figure 6 illustrates the normalized inference times to clearly highlight the incremental budget required. A value of 00 indicates the absence of retrieved code instances. Our observations reveal that as the number of retrieved code instances increases, the corresponding inference time exhibits a slow, logarithmic increase rather than a linear one. This pattern underscores the efficiency of our approach, demonstrating that integrating retrieved code instances significantly enhances performance without proportionally increasing the inference overhead.

3.4.3. Pass-rate@@@@k and Temperature

We explore the effects of varying the temperature settings and the pass-rate@@@@k for different values of k𝑘kitalic_k, where k𝑘kitalic_k represents the number of answers generated by the LLM, as discussed in (Chen et al., 2021). Typically, a lower temperature setting yields more deterministic outcomes, whereas higher temperatures result in more varied outputs.  Figure 7 demonstrates that as k𝑘kitalic_k increases, so does overall performance. Specifically, at a lower temperature, such as 0.10.10.10.1, outputs are more consistent, leading to a narrower performance range. Conversely, at higher temperatures, like 1.21.21.21.2, outputs become more varied, enhancing the likelihood of generating correct answers as k𝑘kitalic_k increases. Notably, for a pass-rate@@@@5, the performance at a temperature of 1.21.21.21.2 surpasses that at 0.70.70.70.7, indicating that increased temperature settings can improve outcomes, particularly at higher values of k𝑘kitalic_k.

4. Related Work

4.1. Automatic Code Debugging

Automatic code debugging has emerged as a promising area within software engineering (Monperrus, 2018; Huang et al., 2023). Given a code with bugs, the task is to automatically fix the code bugs with the correct functions, which alleviates the burden of manual debugging and fixing code faults. Classic techniques can be mainly classified as template-based (Jiang et al., 2018; Liu et al., 2019), heuristic-based (Wen et al., 2018; Yuan and Banzhaf, 2018), constraint-based (Xuan et al., 2016; Xiong et al., 2017), and neural network-based approaches (Fu et al., 2022; Chen et al., 2022; Zhang et al., 2021). Specifically, template-based approaches apply expert-defined code patterns to fix bugs. These approaches can only repair codes in specific patterns and lack generalization to other bugs. Heuristic-based approaches apply predefined heuristics and cannot cover all types of bugs. Constraint-based approaches repair buggy codes by solving a constraint problem. These methods can be accurate, but they are computationally expensive. Neural network-based approaches need numerous high-quality labeled training data pairs (i.e., pairs of buggy codes and fixed codes) to optimize parameters, which is time-consuming to collect the high-quality code pairs.

Recently, large language models (LLMs) have shed new light on automatic code debugging. The prevailing hypothesis suggests that LLMs, through training on extensive repositories of open-source code snippets, are adept at identifying bug patterns and facilitating the repair of defective code. Contemporary strategies employing LLMs predominantly utilize retrieval augmented generation (RAG) and sophisticated prompt engineering techniques to address debugging challenges. Notably, Self-debug (Chen et al., 2023) represents a pioneering effort in applying LLMs to code debugging, employing targeted prompt engineering for enhanced effectiveness. RTLfixer (Tsai et al., 2023) utilizes both RAG and prompt engineering to tackle the HDL debugging problem. However, these methods do not show satisfactory results in our industry-level cases due to the lack of requisite knowledge of HDL codes. Fine-tuning with HDL code resources is one alternative to tackle the problem. Nevertheless, these approaches need abundant and high-quality labeled data, which is not suitable for HDL codes, since the related HDL codes are limited due to privacy and commercial issues.

4.2. large language models for Code Generation

Large language models (LLMs) have transformed the landscape of code generation by leveraging vast amounts of code data to predict and generate syntactically and semantically correct code snippets (Wang et al., 2023b). Notable among these models is OpenAI’s Codex (Chen et al., 2021), which powers GitHub Copilot, offering context-aware code suggestions and completions to developers directly within their IDEs. Another key contribution is from DeepMind’s AlphaCode (Li et al., 2022), which excels in generating code solutions for competitive programming challenges and obtains the top percentile of participants in coding competitions. Recent advancements in LLMs tailored for the coding domain have seen significant contributions, with notable examples including DeepSeek (Bi et al., 2024), Starcoder (Li et al., 2023), Stabelcode (Rombach et al., 2022), Codellama (Roziere et al., 2023), and WizardCoder (Luo et al., 2023). These models have been trained on extensive datasets, both proprietary and open-source, to enhance capabilities in code generation, completion, and debugging. Within the realm of hardware description languages (HDLs), ChipNeMo (Liu et al., 2023a) represents a pioneering effort in develo** a domain-specific LLM, highlighting the challenges and potential of applying LLMs to the hardware domain. Despite being trained on vast hardware-specific datasets, ChipNeMo achieves performance that, while competitive, does not surpass that of state-of-the-art general LLMs, such as GPT-4. This underscores the inherent complexities of adapting LLMs to the nuances of HDL. Concurrently, initiatives like VeriGen (Thakur et al., 2023) and RTLCoder (Liu et al., 2023b), which focus on fine-tuning LLMs using specialized datasets, have demonstrated remarkable results. These developments underscore the evolving landscape of LLM applications in HDL debugging, highlighting both achievements and areas ripe for further exploration.

5. Conclusion

In this paper, we propose an LLM-assisted HDL debugging framework, namely HDLdebugger, which consists of HDL debugging data generation, a search engine, and a retrieval-augmented LLM fine-tuning approach. Through extensive and varied experimentation with multiple LLMs, we have unearthed pivotal findings within the domain. Our method significantly surpasses existing techniques, achieving an exceptional pass-rate of up to 81.93%, indicating that HDLdebugger could automate and streamline HDL debugging for chip design. In addition, we provide in-depth experimental analysis and outline potential future directions for HDL debugging with LLMs in Appendix D.

References

  • (1)
  • Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Aleman, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
  • Aizawa (2003) Akiko Aizawa. 2003. An information-theoretic perspective of tf–idf measures. Information Processing & Management 39, 1 (2003), 45–65.
  • Bi et al. (2024) Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, et al. 2024. DeepSeek LLM: Scaling Open-Source Language Models with Longtermism. arXiv preprint arXiv:2401.02954 (2024).
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  • Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
  • Chen and Guestrin (2016) Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 785–794.
  • Chen et al. (2023) Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2023. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128 (2023).
  • Chen et al. (2022) Zimin Chen, Steve Kommrusch, and Martin Monperrus. 2022. Neural transfer learning for repairing security vulnerabilities in c code. IEEE Transactions on Software Engineering 49, 1 (2022), 147–165.
  • Cong et al. (2011) Jason Cong, Bin Liu, Stephen Neuendorffer, Juanjo Noguera, Kees Vissers, and Zhiru Zhang. 2011. High-level synthesis for FPGAs: From prototy** to deployment. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 30, 4 (2011), 473–491.
  • Dao et al. (2022) Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems 35 (2022), 16344–16359.
  • Fu et al. (2022) Michael Fu, Chakkrit Tantithamthavorn, Trung Le, Van Nguyen, and Dinh Phung. 2022. VulRepair: a T5-based automated software vulnerability repair. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 935–947.
  • Gao et al. (2023) Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, **liu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. 2023. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997 (2023).
  • Gordon (1995) Mike Gordon. 1995. The semantic challenge of Verilog HDL. In Proceedings of tenth annual IEEE symposium on logic in computer science. IEEE, 136–145.
  • Hochbaum (1996) Dorit S Hochbaum. 1996. Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems. In Approximation algorithms for NP-hard problems. 94–143.
  • Huang et al. (2023) Kai Huang, Zhengzi Xu, Su Yang, Hongyu Sun, Xuejun Li, Zheng Yan, and Yuqing Zhang. 2023. A Survey on Automated Program Repair Techniques. arXiv preprint arXiv:2303.18184 (2023).
  • Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7B. arXiv preprint arXiv:2310.06825 (2023).
  • Jiang et al. (2018) Jiajun Jiang, Yingfei Xiong, Hongyu Zhang, Qing Gao, and Xiangqun Chen. 2018. Sha** program repair space with existing patches and similar code. In Proceedings of the 27th ACM SIGSOFT international symposium on software testing and analysis. 298–309.
  • Li et al. (2023) Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. 2023. StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023).
  • Li et al. (2022) Yujia Li, David Choi, Junyoung Chung, Nate Kushman, et al. 2022. Competition-level code generation with alphacode. Science 378, 6624 (2022), 1092–1097.
  • Liu et al. (2019) Kui Liu, Anil Koyuncu, et al. 2019. TBar: Revisiting template-based automated program repair. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. 31–42.
  • Liu et al. (2023a) Mingjie Liu, Teodor-Dumitru Ene, Robert Kirby, Chris Cheng, Nathaniel Pinckney, Rongjian Liang, Jonah Alben, Himyanshu Anand, Sanmitra Banerjee, Ismet Bayraktaroglu, et al. 2023a. Chipnemo: Domain-adapted llms for chip design. arXiv preprint arXiv:2311.00176 (2023).
  • Liu et al. (2023b) Shang Liu, Wenji Fang, Yao Lu, Qijun Zhang, Hongce Zhang, and Zhiyao Xie. 2023b. Rtlcoder: Outperforming gpt-3.5 in design rtl generation with our open-source dataset and lightweight solution. arXiv preprint arXiv:2312.08617 (2023).
  • Luo et al. (2023) Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, et al. 2023. WizardCoder: Empowering Code Large Language Models with Evol-Instruct. arXiv preprint arXiv:2306.08568 (2023).
  • Mitra et al. (2023) Arindam Mitra, Luciano Del Corro, Shweti Mahajan, Andres Codas, Clarisse Simoes, Sahaj Agarwal, Xuxi Chen, Anastasia Razdaibiedina, Erik Jones, Kriti Aggarwal, et al. 2023. Orca 2: Teaching small language models how to reason. arXiv preprint arXiv:2311.11045 (2023).
  • Monperrus (2018) Martin Monperrus. 2018. Automatic software repair: A bibliography. ACM Computing Surveys (CSUR) 51, 1 (2018), 1–24.
  • Navarro (2001) Gonzalo Navarro. 2001. A guided tour to approximate string matching. ACM computing surveys (CSUR) 33, 1 (2001), 31–88.
  • Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019).
  • Rasley et al. (2020) Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 3505–3506.
  • Rigatti (2017) Steven J Rigatti. 2017. Random forest. Journal of Insurance Medicine 47, 1 (2017), 31–39.
  • Robertson et al. (2004) Stephen Robertson, Hugo Zaragoza, and Michael Taylor. 2004. Simple BM25 extension to multiple weighted fields. In Proceedings of the thirteenth ACM international conference on Information and knowledge management. 42–49.
  • Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695.
  • Roziere et al. (2023) Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, **gyu Liu, Tal Remez, Jérémy Rapin, et al. 2023. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023).
  • Taori et al. (2023) Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca.
  • Thakur et al. (2023) Shailja Thakur, Baleegh Ahmad, Hammond Pearce, Benjamin Tan, Brendan Dolan-Gavitt, Ramesh Karri, and Siddharth Garg. 2023. Verigen: A large language model for verilog code generation. arXiv preprint arXiv:2308.00708 (2023).
  • Tsai et al. (2023) YunDa Tsai, Mingjie Liu, and Haoxing Ren. 2023. RTLFixer: Automatically Fixing RTL Syntax Errors with Large Language Models. arXiv preprint arXiv:2311.16543 (2023).
  • Tsourakakis (2015) Charalampos Tsourakakis. 2015. The k-clique densest subgraph problem. In Proceedings of the 24th international conference on world wide web. 1122–1132.
  • Wang et al. (2023a) Guan Wang, Sijie Cheng, Xianyuan Zhan, Xiangang Li, Sen Song, and Yang Liu. 2023a. Openchat: Advancing open-source language models with mixed-quality data. arXiv preprint arXiv:2309.11235 (2023).
  • Wang et al. (2023b) Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. 2023b. Software testing with large language model: Survey, landscape, and vision. arXiv preprint arXiv:2307.07221 (2023).
  • Wang et al. (2022) Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2022. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560 (2022).
  • Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
  • Wen et al. (2018) Ming Wen, Junjie Chen, Rongxin Wu, Dan Hao, et al. 2018. Context-aware patch generation for better automated program repair. In Proceedings of the 40th international conference on software engineering. 1–11.
  • Wolf et al. (2019) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2019. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019).
  • Xiong et al. (2017) Yingfei Xiong, Jie Wang, Runfa Yan, Jiachen Zhang, Shi Han, Gang Huang, and Lu Zhang. 2017. Precise condition synthesis for program repair. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE). IEEE, 416–426.
  • Xuan et al. (2016) Jifeng Xuan, Matias Martinez, Favio Demarco, Maxime Clement, Sebastian Lamelas Marcote, Thomas Durieux, Daniel Le Berre, and Martin Monperrus. 2016. Nopol: Automatic repair of conditional statement bugs in java programs. IEEE Transactions on Software Engineering 43, 1 (2016), 34–55.
  • Yao et al. (2023) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023).
  • Yao et al. (2022) Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629 (2022).
  • Yuan and Banzhaf (2018) Yuan Yuan and Wolfgang Banzhaf. 2018. Arja: Automated repair of java programs via multi-objective genetic programming. IEEE Transactions on software engineering 46, 10 (2018), 1040–1067.
  • Zhang et al. (2015) Chen Zhang, Peng Li, Guangyu Sun, Yi** Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA international symposium on field-programmable gate arrays. 161–170.
  • Zhang et al. (2021) Xiaoyu Zhang, Juan Zhai, Shiqing Ma, and Chao Shen. 2021. Autotrainer: An automatic dnn training problem detection and repair system. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 359–371.

Appendix A Additional Materials for Data Generation

Table 6 indicates the common error type in HDL code summarized by LLM. By locating the main cause of errors, we can design multiple modification functions to recur the errors in a given HDL code. It is easy to reveal that some simple and similar modifications can result in totally different errors when matching the error causes and modification functions. These modification functions then become an essential part of the reverse engineering pipeline and the foundation of solutions in the RAG search engine.

The error message and solution database of the RAG search engine is built on the foundation of the modification functions. Table 7 indicates information in the database. For each common error code in HDL code compilation, it provides related descriptions and error root reasons that help debug. And most essentially, the database collects recommending solutions for a given error code. When retrieving knowledge with the RAG search engine, the root reason and solution together with the error message and solution will be returned to form LLM input.

Table 6. Sample HDL error and modification functions.
Error Type Error Cause Modification Function
Memory RAM connection or RAM port Pulse signal modification
Clock Clock unit connection and availability Pulse signal or assignment modification
Trace Trace unit availability and usage Assignment modification
Compression Compression unit definition and I/O Register or assignment modification
Data Data Value Register or pulse signal modification
Test Point Procedures completeness All variables modification
Pattern Reform Pattern equivalence Assignment and probe modification
Syntax E Sscript syntax error Syntax modification
Netlist E Netlist attribute Netlist modification
Simulation E Simulation config Config modification
  • 1

    The errors here are abstracted and modified for information security, but still reflect the categories and causes without revealing details.

Table 7. Examples of error information.
Error code Descriptions Root-Reasons Solutions
C-error-1 Netlist not correctly obtain defined signal Definition not match netlist Modify definition or netlist
C-error-2 Not define top port as output Signal not output Modify top port as output
C-error-6 Clock signal not off during ** ** get wrong signal Correctly off clock signal
T-error-2 Clock definition duplicate clock name conflict Delete one definition
T-error-4 Probe initilized as 0 in ** Invalid initialization Modify probe initialization to non-0
T-error-18 Assignment in non-initialization stage Wrong assignment Delete assignment
T-error-27 Pulse non-exist variable wrong pulse Delete pulse
M-error-1 Memory close when read Wrongly off memory Correct memory and constraint
M-error-17 Logical loop in netlist Logical loop in netlist Modify netlist
P-error-8 Definition lack of shift Wrong definition Modify definition
  • 1

    The rules here are abstracted and modified for information security, but still reflect the meanings, causes and solutions without revealing details.

Appendix B Additional Materials for RAG

B.1. BERT-LSTM

The BERT-LSTM model synergizes the contextual embedding capabilities of BERT (Bidirectional Encoder Representations from Transformers) with the sequential data processing strength of LSTM (Long Short-Term Memory) networks. Specifically, given a buggy code b𝑏bitalic_b and its error message m𝑚mitalic_m, we first concatenate them into a single sequence qs=[CLS,b,SEP,m]𝑞𝑠CLS𝑏SEP𝑚qs=[\mathrm{CLS},b,\mathrm{SEP},m]italic_q italic_s = [ roman_CLS , italic_b , roman_SEP , italic_m ]. Since the token length of buggy code and error message is too long, we separate the query sequence qs𝑞𝑠qsitalic_q italic_s into a set of sub-sequences {sk}k=1|qs|/nssuperscriptsubscriptsubscript𝑠𝑘𝑘1𝑞𝑠subscript𝑛𝑠\{s_{k}\}_{k=1}^{|qs|/n_{s}}{ italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_q italic_s | / italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where each subsequence has nssubscript𝑛𝑠n_{s}italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT tokens. Then, we fed each subsequence sksubscript𝑠𝑘s_{k}italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT into BERT, and BERT will output a sentence-level vector 𝐡k=BERT(sk)subscript𝐡𝑘𝐵𝐸𝑅𝑇subscript𝑠𝑘\mathbf{h}_{k}=BERT(s_{k})bold_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_B italic_E italic_R italic_T ( italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), where 𝐡ksubscript𝐡𝑘\mathbf{h}_{k}bold_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT captures the context information of tokens in sksubscript𝑠𝑘s_{k}italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. After obtaining all subsequence embeddings {𝐡k}k=1|qs|/nssuperscriptsubscriptsubscript𝐡𝑘𝑘1𝑞𝑠subscript𝑛𝑠\{\mathbf{h}_{k}\}_{k=1}^{|qs|/n_{s}}{ bold_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_q italic_s | / italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT for {sk}k=1|qs|/nssuperscriptsubscriptsubscript𝑠𝑘𝑘1𝑞𝑠subscript𝑛𝑠\{s_{k}\}_{k=1}^{|qs|/n_{s}}{ italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_q italic_s | / italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, we employ a Bi-LSTM to capture long-range dependencies and intricate patterns across different subsequences as {𝐡k}k=1|qs|/ns=BiLSTM({𝐡k}k=1|qs|/ns)superscriptsubscriptsubscriptsuperscript𝐡𝑘𝑘1𝑞𝑠subscript𝑛𝑠BiLSTMsuperscriptsubscriptsubscript𝐡𝑘𝑘1𝑞𝑠subscript𝑛𝑠\{\mathbf{h}^{{}^{\prime}}_{k}\}_{k=1}^{|qs|/n_{s}}=\mathrm{BiLSTM}(\{\mathbf{% h}_{k}\}_{k=1}^{|qs|/n_{s}}){ bold_h start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_q italic_s | / italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = roman_BiLSTM ( { bold_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_q italic_s | / italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ). Upon acquiring the sequence of Bi-LSTM outputs, we use a self-attention mechanism to generate the final representation of qs𝑞𝑠qsitalic_q italic_s as 𝐳s=Self_Attention({𝐡k}k=1|q|/ns)superscript𝐳𝑠Self_Attentionsuperscriptsubscriptsubscriptsuperscript𝐡𝑘𝑘1𝑞subscript𝑛𝑠\mathbf{z}^{s}=\mathrm{Self\_Attention}(\{\mathbf{h}^{{}^{\prime}}_{k}\}_{k=1}% ^{|q|/n_{s}})bold_z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = roman_Self _ roman_Attention ( { bold_h start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_q | / italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ). The self-attention blocks attend to different parts of the LSTM sequence, enabling the model to focus on the most relevant features for classification. Then, we feed 𝐳ssuperscript𝐳𝑠\mathbf{z}^{s}bold_z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT to a linear layer to predict buggy pattern probability for the buggy code query (b,e)𝑏𝑒(b,e)( italic_b , italic_e ).

B.2. Greedy Algorithm

As introduced in Sec. 2.2, the top-k𝑘kitalic_k buggy code problem in Eq. (2) is NP-hard, indicating that we cannot obtain the optimal top-k𝑘kitalic_k buggy code instances Dcqsubscriptsuperscript𝐷𝑞𝑐D^{q}_{c}italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT in any polynomial time. Thus, we propose a greedy algorithm in Alg. 1 with a theoretical guarantee to optimize the re-rank objective in Eq. (2). For clarification, we denote the Eq. (2) by S(Dcq)𝑆subscriptsuperscript𝐷𝑞𝑐S(D^{q}_{c})italic_S ( italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) as follows.

S(Dcq)=maxDcqD^cqIiDcqsim(q,Ii)+1kIiDcqdis(Ii,Dcq)𝑆subscriptsuperscript𝐷𝑞𝑐subscriptsubscriptsuperscript𝐷𝑞𝑐subscriptsuperscript^𝐷𝑞𝑐subscriptsubscript𝐼𝑖subscriptsuperscript𝐷𝑞𝑐𝑠𝑖𝑚𝑞subscript𝐼𝑖1𝑘subscriptsubscript𝐼𝑖subscriptsuperscript𝐷𝑞𝑐𝑑𝑖𝑠subscript𝐼𝑖subscriptsuperscript𝐷𝑞𝑐S(D^{q}_{c})=\max_{D^{q}_{c}\subseteq\hat{D}^{q}_{c}}{\sum_{I_{i}\in D^{q}_{c}% }{sim(q,I_{i})}+\frac{1}{k}\cdot\sum_{I_{i}\in D^{q}_{c}}{dis(I_{i},D^{q}_{c})}}italic_S ( italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) = roman_max start_POSTSUBSCRIPT italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ⊆ over^ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s italic_i italic_m ( italic_q , italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ⋅ ∑ start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_d italic_i italic_s ( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT )

The basic idea of Alg. 1 is to greedily select the buggy code instance that can bring the maximum information gain into the selected instance set Dcqsubscriptsuperscript𝐷𝑞𝑐D^{q}_{c}italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT until it exceeds the number k𝑘kitalic_k. Specifically, given the instance set Dcqsubscriptsuperscript𝐷𝑞𝑐D^{q}_{c}italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, we first define the marginal information gain of Ijsubscript𝐼𝑗I_{j}italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT as follows:

(9) S(Ij|Dcq)=S(IjDcq)S(Dcq)𝑆conditionalsubscript𝐼𝑗subscriptsuperscript𝐷𝑞𝑐𝑆subscript𝐼𝑗subscriptsuperscript𝐷𝑞𝑐𝑆subscriptsuperscript𝐷𝑞𝑐\triangle S(I_{j}|D^{q}_{c})=S(I_{j}\cup D^{q}_{c})-S(D^{q}_{c})△ italic_S ( italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) = italic_S ( italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∪ italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) - italic_S ( italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT )

As illustrated in Alg. 1, we first initialize the instance set Dcqsubscriptsuperscript𝐷𝑞𝑐D^{q}_{c}italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT as \emptyset (line 1). Then, we compute the similarity score sim(q,Ij)𝑠𝑖𝑚𝑞subscript𝐼𝑗sim(q,I_{j})italic_s italic_i italic_m ( italic_q , italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) for each IjD^cqsubscript𝐼𝑗subscriptsuperscript^𝐷𝑞𝑐I_{j}\in\hat{D}^{q}_{c}italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ over^ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT (lines 2-4). Next, we compute the marginal information gain of each IjD^cqsubscript𝐼𝑗subscriptsuperscript^𝐷𝑞𝑐I_{j}\in\hat{D}^{q}_{c}italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ over^ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and select node I*superscript𝐼I^{*}italic_I start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT with the maximum S(Ij|Dcq)𝑆conditionalsubscript𝐼𝑗subscriptsuperscript𝐷𝑞𝑐\triangle S(I_{j}|D^{q}_{c})△ italic_S ( italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) following Eq. (9) (line 6-9). Then, we add I*superscript𝐼I^{*}italic_I start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT into Dcqsubscriptsuperscript𝐷𝑞𝑐{D}^{q}_{c}italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, and remove it from D^cqsubscriptsuperscript^𝐷𝑞𝑐\hat{D}^{q}_{c}over^ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT (lines 10-11). We repeat the selection procedure until we have selected k𝑘kitalic_k buggy code instances (lines 5-12).

Theorem B.1 ().

Alg. 1 can achieve a 11/e11𝑒1-1/e1 - 1 / italic_e approximation ratio.

Proof.

The S(Dcq)𝑆subscriptsuperscript𝐷𝑞𝑐S(D^{q}_{c})italic_S ( italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) in Eq. (2) is monotone and submodular.

  • Monotone: Given any IiD^qcsubscript𝐼𝑖subscriptsuperscript^𝐷𝑐𝑞I_{i}\in\hat{D}^{c}_{q}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ over^ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT and selected instances set Dqcsubscriptsuperscript𝐷𝑐𝑞D^{c}_{q}italic_D start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, we can obtain S(DcqIi)S(Dcq)0𝑆subscriptsuperscript𝐷𝑞𝑐subscript𝐼𝑖𝑆subscriptsuperscript𝐷𝑞𝑐0S(D^{q}_{c}\cup I_{i})-S(D^{q}_{c})\geq 0italic_S ( italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∪ italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_S ( italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ≥ 0. Thus, S(Dcq)𝑆subscriptsuperscript𝐷𝑞𝑐S(D^{q}_{c})italic_S ( italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) is monotone increasing.

  • Submodularity: Given D~cqDcqsubscriptsuperscript~𝐷𝑞𝑐subscriptsuperscript𝐷𝑞𝑐{\tilde{D}}^{q}_{c}\subset{D}^{q}_{c}over~ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ⊂ italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, and a buggy code instance IiD~cq,Dcqsubscript𝐼𝑖subscriptsuperscript~𝐷𝑞𝑐subscriptsuperscript𝐷𝑞𝑐I_{i}\notin\tilde{D}^{q}_{c},{D}^{q}_{c}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∉ over~ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, we can obtain:

    (10) S(D~cqIi)S(D~cq)=sim(q,Ii)+1kdis(Ii,D~cq)𝑆subscriptsuperscript~𝐷𝑞𝑐subscript𝐼𝑖𝑆subscriptsuperscript~𝐷𝑞𝑐𝑠𝑖𝑚𝑞subscript𝐼𝑖1𝑘𝑑𝑖𝑠subscript𝐼𝑖subscriptsuperscript~𝐷𝑞𝑐\displaystyle S(\tilde{D}^{q}_{c}\cup I_{i})-S(\tilde{D}^{q}_{c})=sim(q,I_{i})% +\frac{1}{k}dis(I_{i},\tilde{D}^{q}_{c})italic_S ( over~ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∪ italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_S ( over~ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) = italic_s italic_i italic_m ( italic_q , italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG italic_k end_ARG italic_d italic_i italic_s ( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over~ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT )
    (11) S(DcqIi)S(Dcq)=sim(q,Ii)+1kdis(Ii,Dcq)𝑆subscriptsuperscript𝐷𝑞𝑐subscript𝐼𝑖𝑆subscriptsuperscript𝐷𝑞𝑐𝑠𝑖𝑚𝑞subscript𝐼𝑖1𝑘𝑑𝑖𝑠subscript𝐼𝑖subscriptsuperscript𝐷𝑞𝑐\displaystyle S({D}^{q}_{c}\cup I_{i})-S({D}^{q}_{c})=sim(q,I_{i})+\frac{1}{k}% dis(I_{i},{D}^{q}_{c})italic_S ( italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∪ italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_S ( italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) = italic_s italic_i italic_m ( italic_q , italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG italic_k end_ARG italic_d italic_i italic_s ( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT )

    Since dis(Ii,D~cq)=minIjD~cq(2sim(Ii,Ij))𝑑𝑖𝑠subscript𝐼𝑖subscriptsuperscript~𝐷𝑞𝑐subscriptsubscript𝐼𝑗subscriptsuperscript~𝐷𝑞𝑐2𝑠𝑖𝑚subscript𝐼𝑖subscript𝐼𝑗dis(I_{i},\tilde{D}^{q}_{c})=\min_{I_{j}\in\tilde{D}^{q}_{c}}(2-sim(I_{i},I_{j% }))italic_d italic_i italic_s ( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over~ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) = roman_min start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ over~ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( 2 - italic_s italic_i italic_m ( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) and D~cqDcqsubscriptsuperscript~𝐷𝑞𝑐subscriptsuperscript𝐷𝑞𝑐{\tilde{D}}^{q}_{c}\subset{D}^{q}_{c}over~ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ⊂ italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, thus we can obtain the inequality as:

    S(D~cqIi)S(D~cq)S(DcqIi)S(Dcq).𝑆subscriptsuperscript~𝐷𝑞𝑐subscript𝐼𝑖𝑆subscriptsuperscript~𝐷𝑞𝑐𝑆subscriptsuperscript𝐷𝑞𝑐subscript𝐼𝑖𝑆subscriptsuperscript𝐷𝑞𝑐S(\tilde{D}^{q}_{c}\cup I_{i})-S(\tilde{D}^{q}_{c})\geq S({D}^{q}_{c}\cup I_{i% })-S({D}^{q}_{c}).italic_S ( over~ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∪ italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_S ( over~ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ≥ italic_S ( italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∪ italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_S ( italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) .

    It demonstrates that S(Dcq)𝑆subscriptsuperscript𝐷𝑞𝑐S({D}^{q}_{c})italic_S ( italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) in Eq. (2) is submodular.

Since S(Dcq)𝑆subscriptsuperscript𝐷𝑞𝑐S({D}^{q}_{c})italic_S ( italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) is monotone and submodular, according to (Hochbaum, 1996), the approximation ratio of Alg. 1 is 11/e11𝑒1-1/e1 - 1 / italic_e. ∎

Time Complexity. Assume the dimension of 𝐳wsuperscript𝐳𝑤\mathbf{z}^{w}bold_z start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT and 𝐳ssuperscript𝐳𝑠\mathbf{z}^{s}bold_z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT are dwsubscript𝑑𝑤d_{w}italic_d start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and dssubscript𝑑𝑠d_{s}italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, respectively. First, it takes O(N(dw+ds))𝑂𝑁subscript𝑑𝑤subscript𝑑𝑠O(N(d_{w}+d_{s}))italic_O ( italic_N ( italic_d start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) to compute the similarity score sim(q,Ij)𝑠𝑖𝑚𝑞subscript𝐼𝑗sim(q,I_{j})italic_s italic_i italic_m ( italic_q , italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) for each IjD^cqsubscript𝐼𝑗subscriptsuperscript^𝐷𝑞𝑐I_{j}\in\hat{D}^{q}_{c}italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ over^ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT (line 2-4). Then, it takes O(k2(dw+ds))𝑂superscript𝑘2subscript𝑑𝑤subscript𝑑𝑠O(k^{2}(d_{w}+d_{s}))italic_O ( italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_d start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) to select the top-k𝑘kitalic_k buggy code instances from D^cqsubscriptsuperscript^𝐷𝑞𝑐\hat{D}^{q}_{c}over^ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT (line 5-12). Thus, the time complexity of the top-k𝑘kitalic_k buggy code selection algorithm is O((N+k2)(dw+ds))𝑂𝑁superscript𝑘2subscript𝑑𝑤subscript𝑑𝑠O((N+k^{2})(d_{w}+d_{s}))italic_O ( ( italic_N + italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( italic_d start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) in total.

Appendix C Retrieval-augmented LLM Fine-tuning

C.1. Implementation Details

We implement our approach in PyTorch (Paszke et al., 2019), and fine-tune on CodeLlama (Roziere et al., 2023) which is provided by huggingface (Wolf et al., 2019) model zoo. The foundation of our code is built upon the FastChat and Alpaca frameworks, incorporating cutting-edge technologies such as flashattention (Dao et al., 2022), deepspeed (Rasley et al., 2020), to enhance effectiveness during both training and inference phases. Our experimental setup utilizes eight NVIDIA-GTX A100 GPUs with 80G memory to ensure enough computational capacity. For training, we primarily adhere to the default hyperparameters. During the inference stage, we employ a greedy decoding strategy, akin to the approach used in ChipNemo (Liu et al., 2023a), to mitigate the significant compilation costs associated with this process.

Algorithm 1 Top-k𝑘kitalic_k relevant code selection
1:Buggy code query q=(b,e)𝑞𝑏𝑒q=(b,e)italic_q = ( italic_b , italic_e ) and N𝑁Nitalic_N code instances D^cqsubscriptsuperscript^𝐷𝑞𝑐\hat{D}^{q}_{c}over^ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, and parameter k𝑘kitalic_k.
2:Top-k𝑘kitalic_k instances DcqD^cqsubscriptsuperscript𝐷𝑞𝑐subscriptsuperscript^𝐷𝑞𝑐D^{q}_{c}\subseteq\hat{D}^{q}_{c}italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ⊆ over^ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT.
3:Initialize: D^cq.subscriptsuperscript^𝐷𝑞𝑐\hat{D}^{q}_{c}\leftarrow\emptyset.over^ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ← ∅ .
4:for IjD^cqsubscript𝐼𝑗subscriptsuperscript^𝐷𝑞𝑐I_{j}\in\hat{D}^{q}_{c}italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ over^ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT do
5:     Sim(q,Ii)𝐄𝐪.(1)formulae-sequence𝑆𝑖𝑚𝑞subscript𝐼𝑖𝐄𝐪italic-(1italic-)Sim(q,I_{i})\leftarrow\mathbf{Eq.}~{}\eqref{eq:sim_score}italic_S italic_i italic_m ( italic_q , italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ← bold_Eq . italic_( italic_)
6:end for
7:for i=1𝑖1i=1italic_i = 1 to k𝑘kitalic_k do
8:     for IjD^cqsubscript𝐼𝑗subscriptsuperscript^𝐷𝑞𝑐I_{j}\in\hat{D}^{q}_{c}italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ over^ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT do
9:         S(Ij|Dcq)=S(IjDcq)S(Dcq)𝑆conditionalsubscript𝐼𝑗subscriptsuperscript𝐷𝑞𝑐𝑆subscript𝐼𝑗subscriptsuperscript𝐷𝑞𝑐𝑆subscriptsuperscript𝐷𝑞𝑐\triangle S(I_{j}|D^{q}_{c})=S(I_{j}\cup D^{q}_{c})-S(D^{q}_{c})△ italic_S ( italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) = italic_S ( italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∪ italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) - italic_S ( italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT )
10:     end for
11:     I*=argmaxIjD^cqS(Ij|Dcq)superscript𝐼subscriptsubscript𝐼𝑗subscriptsuperscript^𝐷𝑞𝑐𝑆conditionalsubscript𝐼𝑗subscriptsuperscript𝐷𝑞𝑐I^{*}=\arg\max_{I_{j}\in\hat{D}^{q}_{c}}{\triangle S(I_{j}|D^{q}_{c})}italic_I start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ over^ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT △ italic_S ( italic_I start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT )
12:     D^cq=D^cqI*subscriptsuperscript^𝐷𝑞𝑐subscriptsuperscript^𝐷𝑞𝑐superscript𝐼\hat{D}^{q}_{c}=\hat{D}^{q}_{c}\setminus I^{*}over^ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = over^ start_ARG italic_D end_ARG start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∖ italic_I start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT
13:     Dcq=DcqI*subscriptsuperscript𝐷𝑞𝑐subscriptsuperscript𝐷𝑞𝑐superscript𝐼{D}^{q}_{c}={D}^{q}_{c}\cup I^{*}italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∪ italic_I start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT
14:end for
15:return Dcqsubscriptsuperscript𝐷𝑞𝑐{D}^{q}_{c}italic_D start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT.
Algorithm 2 Thoughts Generation Flow
1:Thoughts generation prompt ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, correct response prompt pcsubscript𝑝𝑐p_{c}italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, buggy code b𝑏bitalic_b, correct code c𝑐citalic_c, error message as m𝑚mitalic_m, retrieved information as ragd𝑟𝑎superscript𝑔𝑑rag^{d}italic_r italic_a italic_g start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, large language model as LLM𝐿𝐿𝑀LLMitalic_L italic_L italic_M, Number of thoughts to generate k𝑘kitalic_k.
2:Generated thoughts 𝐭𝐢subscript𝐭𝐢\mathbf{t_{i}}bold_t start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT.
3:Initialize: G{}𝐺G\leftarrow\{\}italic_G ← { }, C{}𝐶C\leftarrow\{\}italic_C ← { }, D{}𝐷D\leftarrow\{\}italic_D ← { }.
4:for j=1𝑗1j=1italic_j = 1 to k𝑘kitalic_k do
5:     tjLLM(ptbcmragd)subscript𝑡𝑗𝐿𝐿𝑀subscript𝑝𝑡𝑏𝑐𝑚𝑟𝑎superscript𝑔𝑑t_{j}\leftarrow LLM(p_{t}\circ b\circ c\circ m\circ rag^{d})italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← italic_L italic_L italic_M ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∘ italic_b ∘ italic_c ∘ italic_m ∘ italic_r italic_a italic_g start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) //Generate thoughts.
6:     GGtj𝐺𝐺subscript𝑡𝑗G\leftarrow G\cup t_{j}italic_G ← italic_G ∪ italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT //Append tjsubscript𝑡𝑗t_{j}italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to G𝐺Gitalic_G.
7:end for
8:for each tjsubscript𝑡𝑗t_{j}italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in G𝐺Gitalic_G do
9:     cjLLM(pcbmragdtj)subscript𝑐𝑗𝐿𝐿𝑀subscript𝑝𝑐𝑏𝑚𝑟𝑎superscript𝑔𝑑subscript𝑡𝑗c_{j}\leftarrow LLM(p_{c}\circ b\circ m\circ rag^{d}\circ t_{j})italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← italic_L italic_L italic_M ( italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∘ italic_b ∘ italic_m ∘ italic_r italic_a italic_g start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∘ italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) //Predicted correct script.
10:     CCcj𝐶𝐶subscript𝑐𝑗C\leftarrow C\cup c_{j}italic_C ← italic_C ∪ italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT //Append cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to C𝐶Citalic_C.
11:end for
12:for each cjsubscript𝑐𝑗c_{j}italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in C𝐶Citalic_C do
13:     Calculate edit distance djEditDistance(c,cj)subscript𝑑𝑗EditDistance𝑐subscript𝑐𝑗d_{j}\leftarrow\text{EditDistance}(c,c_{j})italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← EditDistance ( italic_c , italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )
14:     DDdj𝐷𝐷subscript𝑑𝑗D\leftarrow D\cup d_{j}italic_D ← italic_D ∪ italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT //Append djsubscript𝑑𝑗d_{j}italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to D𝐷Ditalic_D.
15:end for
16:Sort D𝐷Ditalic_D and obtain related thoughts tj*subscriptsuperscript𝑡𝑗t^{*}_{j}italic_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT,
17:return tj*subscriptsuperscript𝑡𝑗t^{*}_{j}italic_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.
Refer to caption
Figure 8. A case of HDL debugging.

C.2. Thoughts Generation

As Algorithm 2 illustrate the flow of thoughts generation, For simplicity, we omit the sample index. The process mainly consist of 4 major steps. We firstly concatenate Thoughts generation prompt ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, Buggy code b𝑏bitalic_b, Correct code c𝑐citalic_c, Error message m𝑚mitalic_m, relevant information ragd𝑟𝑎superscript𝑔𝑑rag^{d}italic_r italic_a italic_g start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, and pass to LLM for inference, result of which stands for generated thoughts t1tksubscript𝑡1subscript𝑡𝑘t_{1}\dots t_{k}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. After obtaining the thoughts for the error code, we then concatenate correct response prompt pcsubscript𝑝𝑐p_{c}italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, Buggy code b𝑏bitalic_b, Error message m𝑚mitalic_m, , and generated thought tjsubscript𝑡𝑗t_{j}italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to construct the step 2 input for LLM. The inference result denotes the predicted correct code script c1cksubscript𝑐1subscript𝑐𝑘c_{1}\dots c_{k}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for correction the given error code. Given above two steps, the code correction is prepared successfully. The rest steps will focus on finding the best suited correction for given error code. To accomplish that, for given predicted correct code script c1cksubscript𝑐1subscript𝑐𝑘c_{1}\dots c_{k}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we obtain the edit distance d1dksubscript𝑑1subscript𝑑𝑘d_{1}\dots d_{k}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT from each to original correct code c𝑐citalic_c. The final procedure is to sort the edit distances and retrieve the generated thought tj*subscriptsuperscript𝑡𝑗t^{*}_{j}italic_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT corresponding to the lowest edit distance djsubscript𝑑𝑗d_{j}italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. The final tj*subscriptsuperscript𝑡𝑗t^{*}_{j}italic_t start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT serves as our though output for the flow.

We delve deeper into the process of self-guided thought generation The core principle of our approach involves submitting both the buggy code and its corrected version to the LLM, subsequently prompting the model to generate instructive guidance. We denote the correct code as cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and thoughts as tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Generally, compute p(ci|ti)𝑝conditionalsubscript𝑐𝑖subscript𝑡𝑖p(c_{i}|t_{i})italic_p ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) proves challenging due to the inaccessibility of suitable thoughts. By leveraging Bayes’ rule, we can reparameterize the formulation as p(ci|ti)p(ti|ci)p(ci)proportional-to𝑝conditionalsubscript𝑐𝑖subscript𝑡𝑖𝑝conditionalsubscript𝑡𝑖subscript𝑐𝑖𝑝subscript𝑐𝑖p(c_{i}|t_{i})\propto p(t_{i}|c_{i})p(c_{i})italic_p ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∝ italic_p ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_p ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). By focusing on p(ti|ci)𝑝conditionalsubscript𝑡𝑖subscript𝑐𝑖p(t_{i}|c_{i})italic_p ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), we encourage the LLM to align its outputs with verified correct code.

C.3. Temperature setting

By modifying temperature we can generate multiple thoughts differently. We use multinomial sampling to generate samples randomly. Define a sequence of input tokens 𝐱={x1,x2,,xn}𝐱subscript𝑥1subscript𝑥2subscript𝑥𝑛\mathbf{x}=\{x_{1},x_{2},...,x_{n}\}bold_x = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, denote πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT as the LLM generation without passing final softmax layer. The next token xn+1subscript𝑥𝑛1x_{n+1}italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT can be obtained by:

(12) out=𝑜𝑢𝑡absent\displaystyle out={}italic_o italic_u italic_t = πθ(𝐱)subscript𝜋𝜃𝐱\displaystyle\pi_{\theta}(\mathbf{x})italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x )
pr=𝑝𝑟absent\displaystyle pr={}italic_p italic_r = softmaxT(out)𝑠𝑜𝑓𝑡𝑚𝑎subscript𝑥𝑇𝑜𝑢𝑡\displaystyle softmax_{T}(out)italic_s italic_o italic_f italic_t italic_m italic_a italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_o italic_u italic_t )
xn+1similar-tosubscript𝑥𝑛1absent\displaystyle x_{n+1}\sim{}italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ∼ Categorical(pr),𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑐𝑎𝑙𝑝𝑟\displaystyle Categorical(pr),italic_C italic_a italic_t italic_e italic_g italic_o italic_r italic_i italic_c italic_a italic_l ( italic_p italic_r ) ,

where out𝑜𝑢𝑡outitalic_o italic_u italic_t is the next token logits output. softmaxT𝑠𝑜𝑓𝑡𝑚𝑎subscript𝑥𝑇softmax_{T}italic_s italic_o italic_f italic_t italic_m italic_a italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT means the softmax function with temperature where the formula of probability pri𝑝subscript𝑟𝑖pr_{i}italic_p italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is eouti/Tjeoutj/Tsuperscript𝑒𝑜𝑢subscript𝑡𝑖𝑇subscript𝑗superscript𝑒𝑜𝑢subscript𝑡𝑗𝑇\frac{e^{out_{i}}/T}{\sum_{j}e^{out_{j}}/T}divide start_ARG italic_e start_POSTSUPERSCRIPT italic_o italic_u italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT / italic_T end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_o italic_u italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT / italic_T end_ARG. T𝑇Titalic_T is the temperature parameter where higher T𝑇Titalic_T makes the output distribution more uniform, thus introducing more randomness. Categorical𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑖𝑐𝑎𝑙Categoricalitalic_C italic_a italic_t italic_e italic_g italic_o italic_r italic_i italic_c italic_a italic_l means Categorical distribution. For example, assume the next token probability distribution is {dog:0.4,cat:0.5,bike:0.1}conditional-set𝑑𝑜𝑔:0.4𝑐𝑎𝑡0.5𝑏𝑖𝑘𝑒:0.1\{dog:0.4,cat:0.5,bike:0.1\}{ italic_d italic_o italic_g : 0.4 , italic_c italic_a italic_t : 0.5 , italic_b italic_i italic_k italic_e : 0.1 }, then the next token is selected according to their probabilities. We can generate K𝐾Kitalic_K thoughts by this sampling strategy. Therefore, we can generate thoughts randomly by controlling the temperature parameter.

C.4. Debugging example

Figure 8 illustrates an example of HDL debugging. When given an error script and related error messages, we first use our search engine to retrieved related expert solutions and similar in-context samples (default 5 buggy code instances). Then we feed buggy codes, error messages, and related in-context samples and expert solutions to our fine-tuned LLMs. Finally, the LLM will produce a corresponding analysis and related correct script.

Appendix D Analysis and Discussion

This section delves into a detailed examination of our research findings, focusing on the utility and implications of leveraging LLMs for debugging within HDL environments. We provide a thorough assessment of LLM-based debugging capabilities and the effects of iterative debugging procedures. Additionally, we explore both the challenges and prospects of incorporating LLMs into the HDL debugging framework.

D.1. Trade-offs in RAG and SFT

Typically, in a Retrieval-Augmented Generation (RAG) system, the LLM acts as an agent that remains untuned to preserve its generalization ability, which might be compromised by task-specific parameter optimization. However, to enhance its efficacy in debugging, we find it essential to fine-tune the LLM, which presents a significant challenge as LLMs are expected to address a broader array of problems. One solution incorporate a wider range of common instructional data during training to retain its general applicability. Another innovative strategy involves constructing a multi-LLM architecture housing numerous ”expert” models to tackle specific domain challenges. Additionally, HDL debugging encompasses various tasks across different stages of electronic design automation, prompting us to consider develo** a comprehensive framework in future investigations.

D.2. Iterative Debugging

Our investigation predominantly concentrates on single-round debugging, noting that existing studies, such as (Tsai et al., 2023), demonstrate that a vast majority of issues (about 90%) are resolvable in a single iteration. Nonetheless, complex scenarios necessitate multiple debugging rounds, posing a substantial challenge due to the high costs associated with gathering multi-round data. Moreover, effective iterative debugging demands LLMs capable of enhanced contextual analysis and comprehension. This area will be a focal point of our subsequent research endeavors.

D.3. Enhancing the User Experience in Debugging Systems

Within our debugging framework, we can quantitatively assess performance metrics such as pass rates and execution times. However, during the deployment phase, we observed that traditional debugging approaches—transforming erroneous scripts into correct ones are insufficient. In the context of HDL, the emphasis extends beyond mere error correction to enhancing the quality of the implementation, given the direct impact of hardware language on chip performance. Code that successfully compiles may not necessarily optimize chip design performance. Additionally, exact solutions are not always required; engineers often benefit from suggestive ”hints” that inspire solutions to complex issues. Nonetheless, develo** a metric to evaluate the quality of these hints represents a significant challenge.