\DTLnewdb

TransposedTabularDB \NewEnvironTtabular[1] \Ca

Cross-Modality Program Representation Learning for Electronic Design Automation with High-Level Synthesis

Zongyue Qin*, Yunsheng Bai*, Atefeh Sohrabizadeh, Zijian Ding, Ziniu Hu, Yizhou Sun, Jason Cong *The two authors contributed equally to this work. Computer ScienceUCLA Los Angeles, US {qinzongyue,yba,atefehsz,bradyd,bull,yzsun,cong}@cs.ucla.edu
Abstract

In recent years, domain-specific accelerators (DSAs) have gained popularity for applications such as deep learning and autonomous driving. To facilitate DSA designs, programmers use high-level synthesis (HLS) to compile a high-level description written in C/C++ into a design with low-level hardware description languages that eventually synthesize DSAs on circuits. However, creating a high-quality HLS design still demands significant domain knowledge, particularly in microarchitecture decisions expressed as pragmas. Thus, it is desirable to automate such decisions with the help of machine learning for predicting the quality of HLS designs, requiring a deeper understanding of the program that consists of original code and pragmas. Naturally, these programs can be considered as sequence data. In addition, these programs can be compiled and converted into a control data flow graph (CDFG). But existing works either fail to leverage both modalities or combine the two in shallow or coarse ways. We propose ProgSG, a model that allows interaction between the source code sequence modality and the graph modality in a deep and fine-grained way. To alleviate the scarcity of labeled designs, a pre-training method is proposed based on a suite of compiler’s data flow analysis tasks. Experimental results show that ProgSG reduces the RMSE of design performance predictions by up to 22%percent2222\%22 %, and identifies designs with an average of 1.10×1.10\times1.10 × and 1.26×1.26\times1.26 × (up to 8.17×8.17\times8.17 × and 13.31×13.31\times13.31 ×) performance improvement in design space exploration (DSE) task compared to HARP and AutoDSE, respectively.

I Introduction

Over the past decades, the need for specialized computing systems to accelerate specific applications has grown, leading to the emergence of domain-specific accelerators (DSAs) like application-specific integrated circuits (ASICs) and field-programmable gate arrays (FPGAs). Designing DSAs is challenging because it involves using hardware description languages (HDLs) at the register-transfer level (RTL) with Verilog and VHDL, which are mainly familiar to circuit designers. High-level synthesis (HLS) was introduced to address this by raising the level of abstraction to C/C++/OpenCL/SystemC, allowing designers to describe high-level behavioral representations of their designs. Despite this, HLS tools still require significant hardware design knowledge through synthesis directives in the form of pragmas, which specify computation parallelization, data caching, memory buffer partitioning, etc. These optimizations are typically done by hardware programmers and are beyond the reach of average software programmers. Our objective is to automate and accelerate the optimization of integrated circuit (IC) design, making it more accessible to software programmers.

There is a growing trend to apply machine learning to IC design automation (Huang et al., 2021a). For example, researchers have developed learning-based methods to predict the quality of HLS designs (Sohrabizadeh et al., 2022a, 2023), to explore the HLS design space intelligently for optimal resource allocation (Wu et al., 2022a), etc. These methods fundamentally rely on an informative representation of an input design for high-quality performance prediction. We, therefore, focus on the representation learning for IC designs defined with HLS C/C++ (in short, we call them HLS designs) which are annotated with compiler directives/pragmas. Specifically, we aim to design an encoder-decoder framework where the encoder provides powerful representations for the input HLS designs so that the designs’ quality can be predicted accurately.

One limitation of the existing representation learning methods for programs and HLS designs is that they usually restrict the model to only using either the source code or the compiler-derived representation, but not both. For example, previous works (Sohrabizadeh et al., 2022a, 2023; Wu et al., 2022b) compile the HLS code into LLVM intermediate representation, which is then further transformed into a graph representation before a graph neural network (GNN) is used to encode it. Meanwhile, (Wang et al., 2021; Kanade et al., 2020; Feng et al., 2020; Guo et al., 2020) directly apply a large language model (LLM) to the source code to obtain the representations that catch the semantics of general computer programs.

However, we argue that only utilizing either one of the modalities is not good enough to obtain a comprehensive program representation. One the one hand, the graph modality tends to ignore the semantic information in the source code which is helpful to understand a program’s behavior. For example, in CDFG, it is difficult for GNN to understand the functionality of a call site, particularly to ones such as standard libraries (e.g., glibc). What is worse, a statement such as “A[i][j] *= beta;” would be converted to a relatively large and complex subgraph in the CDFG making it difficult for the model to understand the semantic meaning. On the other hand, two source code programs with similar semantics and functionalities could have significantly different latency and communication requirements. This is where the lower-level control-flow structure of the programs can help. Therefore, a novel model that effectively utilize information from both modalities could be the key to generating powerful representations of HLS designs and general programs.

In this paper, we propose ProgSG (Program representation learning combining the source Sequence and the control data flow Graph) for a unified representation learning that leverages both the source code modality and an enriched CDFG graph modality, with pre-training performed on both modalities. To handle the interaction between source code and CDFG graph, we propose two innovative designs in the architecture: (1) An attention-summary architecture for coarse interaction between the two modalities; (2) A fine-grained node-to-token message passing mechanism to enable further collaboration between the two modalities. We also propose a novel pre-training method based on predicting node-node relationships for compiler analysis tasks which helps the GNN encoder to address the label scarcity issue. Experiment results show the proposed ProgSG achieves a state-of-art performance on design quality prediction and design space exploration.

II Preliminaries

II-A HLS Design and Optimization Pragmas

The goal of this paper is to train a model to effectively predicts the quality of the HLS design, which is a C/C++ program with inserted pragmas serving as design specification. The quality of a design is measured by its latency in cycle counts (perf), the utilization rate of block RAM (util-BRAM), digital signal processors (util-DSP), flip-flop (util-FF), and lookup-tables (util-LUT) (Sohrabizadeh et al., 2022b, a).

We specifically consider the optimization pragmas of the Merlin Compiler, an open-source tool widely used for HLS designs111https://github.com/Xilinx/merlin-compiler. The Merlin Compiler provides three types of optimization pragmas, namely PIPELINE, PARALLEL, and TILE to define the desired microarchitecture (Sohrabizadeh et al., 2022b). As shown in Code LABEL:code:mvt in the Appendix, these pragmas can be applied at the loop level and offer control over the type of pipelining, the parallelization factor, and the amount of data caching. Table I summarizes the parameter space of these pragmas. For a given program P𝑃Pitalic_P, any change in the option of any of the pragmas results in a different design D𝐷Ditalic_D with a unique microarchitecture. For example, the “fg” option in pipelining refers to the case where all the inner loops are unrolled (parallelized with separate logic) and each parallel unit is pipelined. The “cg” option, on the other hand, results in coarse-grained processing elements (PEs) that are pipelined together. For example, it can create pipelined load-compute-store units. The PARALLEL and TILE pragmas take numeric values that determine the degree of parallelization and loop tiling, respectively.

TABLE I: Target pragmas with their options.
Pragma Parameter Name Parameter Space
PARALLEL factor integer
PIPELINE mode “cg”, “fg”, off
TILE factor integer

II-B Hierarchical Graph Representation of HLS Designs 

Refer to caption
Figure 1: An Illustration of HARP control data flow graph. Compared with a normal CDFG, it has additional block nodes and three types of edges: intra-block edges, block-flow edges, and hierarchy-level edges.

We leverage HARP’s approach (Sohrabizadeh et al., 2023) to generate the hierarchical graph representation of an HLS design, which is an enriched CDFG with extra block nodes and their connections. Figure 1 depicts an illustration of a HARP graph. Specifically, given the source code C=(c1,,cI)𝐶subscript𝑐1subscript𝑐𝐼C=(c_{1},\ldots,c_{I})italic_C = ( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ) (cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, i=1,,I𝑖1𝐼i=1,\ldots,Iitalic_i = 1 , … , italic_I denotes the i𝑖iitalic_i-th token of the source code), it is first transformed into an LLVM (Lattner and Adve, 2004) intermediate representation (IR), and further converted into a CDFG222Strictly speaking, it is a modified ProGraML graph with additional call relations between instructions and explicit nodes for operands with additional pragma nodes, but for convenience and without loss of generality, we use the term “CDFG” in this paper.. Then to insert hierarchical information into the graph, auxiliary nodes are added into the graph where each auxiliary node represents a distinct LLVM IR block. Each of these blocks is a sequence of instructions that has a single entry point and a single exit point. Each auxiliary node has three types of edges: the edges to all instruction and data nodes within that block (intra-block edges), the edges to the previous and next block (block flow edges), and the edges building connections based on the hierarchy level of the "for" loops in the C/C++ code (hierarchy level edges). HARP (Sohrabizadeh et al., 2023) shows that the hierarchical graph representation helps propagate the long-range dependency information in the graph, which helps it learn a better graph representation.

III Proposed Method: ProgSG

In this section, we first describe the overall encoder-decoder architecture of ProgSG. Then, we focus on our novel encoder with a graph summary augmented sequence representation, and a fine-grained node-to-token alignment for the unification of the two modalities. Finally, we introduce a novel pre-training framework for program graphs.

III-A Overall Architecture for Design Quality Prediction

Given a design D𝐷Ditalic_D with source code C𝐶Citalic_C and HARP graph G𝐺Gitalic_G, the overall model f(D)=f(C,G)𝑓𝐷𝑓𝐶𝐺f(D)=f(C,G)italic_f ( italic_D ) = italic_f ( italic_C , italic_G ) first encodes designs into a set of embeddings, and then generates predictions 𝒚^^𝒚\hat{\bm{y}}over^ start_ARG bold_italic_y end_ARG with a multilayer perceptron (MLP) based decoder. Figure 2 depicts the overall diagrams of our model. Let 𝒚𝒚\bm{y}bold_italic_y indicate the ground-truth targets (i.e., perf, util-BRAM, util-DSP, util-FF, and util-LUT). Our objective is to minimize the loss function that measures the mean squared error (MSE) between 𝒚𝒚\bm{y}bold_italic_y and 𝒚^^𝒚\hat{\bm{y}}over^ start_ARG bold_italic_y end_ARG, i.e., task=𝒚^𝒚2subscripttasksuperscriptnorm^𝒚𝒚2\mathcal{L}_{\mathrm{task}}=||\hat{\bm{y}}-\bm{y}||^{2}caligraphic_L start_POSTSUBSCRIPT roman_task end_POSTSUBSCRIPT = | | over^ start_ARG bold_italic_y end_ARG - bold_italic_y | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Since one modality is the source code sequence, and the other is the HARP graph, it is natural to adopt a transformer model on C𝐶Citalic_C and a GNN model on G𝐺Gitalic_G, which produce token representations {𝒉jd|j{1,,I}}conditional-setsubscript𝒉𝑗superscript𝑑𝑗1𝐼\{\bm{h}_{j}\in\mathbb{R}^{d}|j\in\{1,\ldots,I\}\}{ bold_italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT | italic_j ∈ { 1 , … , italic_I } } via the transformer’s self-attention mechanism, and node representations {𝒉kd|k{1,,|V|}}conditional-setsubscript𝒉𝑘superscript𝑑𝑘1𝑉\{\bm{h}_{k}\in\mathbb{R}^{d}|k\in\{1,\ldots,|V|\}\}{ bold_italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT | italic_k ∈ { 1 , … , | italic_V | } } via the message passing mechanism, respectively. d𝑑ditalic_d denotes the embedding dimension. The starting token c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT’s embedding is then taken as the source code summary, 𝒉srcdsubscript𝒉srcsuperscript𝑑\bm{h}_{\mathrm{src}}\in\mathbb{R}^{d}bold_italic_h start_POSTSUBSCRIPT roman_src end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, and a graph-level aggregation can be performed on the node embeddings serving as the graph summary, 𝒉graphdsubscript𝒉graphsuperscript𝑑\bm{h}_{\mathrm{graph}}\in\mathbb{R}^{d}bold_italic_h start_POSTSUBSCRIPT roman_graph end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. The encoder outputs the concatenation of the two modalities summaries, concat(𝒉src,𝒉graph)concatsubscript𝒉srcsubscript𝒉graph\mathrm{concat}(\bm{h}_{\mathrm{src}},\bm{h}_{\mathrm{graph}})roman_concat ( bold_italic_h start_POSTSUBSCRIPT roman_src end_POSTSUBSCRIPT , bold_italic_h start_POSTSUBSCRIPT roman_graph end_POSTSUBSCRIPT ), and lets the MLP-based decoder generate predictions.

This model serves as the foundation of our architecture. However, it solely relies on the MLP-based decoder to manage the interaction between the two modalities. We denote this simplified version of our model as ProgSG-ca.

III-B ProgSG-si: Graph-Summary-Augmented Sequence Representation

One limitation of the ProgSG-ca encoder is the shallow and ineffective modeling of the interaction between C𝐶Citalic_C and G𝐺Gitalic_G. We propose a novel yet simple way to address the issue, by making the following observation: The transformer operates on the sequence of tokens C=(c1,,cI)𝐶subscript𝑐1subscript𝑐𝐼C=(c_{1},\ldots,c_{I})italic_C = ( italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ) by enabling every token to pay attention to every other token. That is,

𝒉src=AGG(gatt(𝒉c1(0),,𝒉cI(0)))subscript𝒉src𝐴𝐺𝐺subscript𝑔attsubscriptsuperscript𝒉0subscript𝑐1subscriptsuperscript𝒉0subscript𝑐𝐼\bm{h}_{\mathrm{src}}=AGG(g_{\mathrm{att}}\big{(}\bm{h}^{(0)}_{c_{1}},\ldots,% \bm{h}^{(0)}_{c_{I}}\big{)})bold_italic_h start_POSTSUBSCRIPT roman_src end_POSTSUBSCRIPT = italic_A italic_G italic_G ( italic_g start_POSTSUBSCRIPT roman_att end_POSTSUBSCRIPT ( bold_italic_h start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , bold_italic_h start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) (1)

where AGG𝐴𝐺𝐺AGGitalic_A italic_G italic_G can be any aggregation function, and gattsubscript𝑔attg_{\mathrm{att}}italic_g start_POSTSUBSCRIPT roman_att end_POSTSUBSCRIPT denotes the multi-layer self-attention encoder of a transformer model, capturing the interaction between pairwise source code tokens, 𝒉cj(0)subscriptsuperscript𝒉0subscript𝑐𝑗\bm{h}^{(0)}_{c_{j}}bold_italic_h start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT stands for the j𝑗jitalic_j-th token’s initial embedding333This is usually implemented by looking it up in a dictionary that maps each token ID into a d𝑑ditalic_d-dimensional learnable vector representing the initial embeddings., 𝒉srcsubscript𝒉src\bm{h}_{\mathrm{src}}bold_italic_h start_POSTSUBSCRIPT roman_src end_POSTSUBSCRIPT denotes the final program-level source code embedding.

Based on the above observation, we propose to insert the graph summary 𝒉graphsubscript𝒉graph\bm{h}_{\mathrm{graph}}bold_italic_h start_POSTSUBSCRIPT roman_graph end_POSTSUBSCRIPT to the beginning of the sequence, forming an augmented sequence representation C(aug)=(𝒉graph,c1,,cI)superscript𝐶augsubscript𝒉graphsubscript𝑐1subscript𝑐𝐼C^{\mathrm{(aug)}}=(\bm{h}_{\mathrm{graph}},c_{1},\ldots,c_{I})italic_C start_POSTSUPERSCRIPT ( roman_aug ) end_POSTSUPERSCRIPT = ( bold_italic_h start_POSTSUBSCRIPT roman_graph end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ) as input to the transformer444This is equivalent to augmenting the initial embedding lookup dictionary with a special token initialized as the output of a GNN.. Overall,

𝒉src=AGG(gatt(𝒉graph,𝒉c1(0),,𝒉cI(0))).subscript𝒉src𝐴𝐺𝐺subscript𝑔attsubscript𝒉graphsubscriptsuperscript𝒉0subscript𝑐1subscriptsuperscript𝒉0subscript𝑐𝐼\bm{h}_{\mathrm{src}}=AGG(g_{\mathrm{att}}\big{(}\bm{h}_{\mathrm{graph}},\bm{h% }^{(0)}_{c_{1}},\ldots,\bm{h}^{(0)}_{c_{I}}\big{)}).bold_italic_h start_POSTSUBSCRIPT roman_src end_POSTSUBSCRIPT = italic_A italic_G italic_G ( italic_g start_POSTSUBSCRIPT roman_att end_POSTSUBSCRIPT ( bold_italic_h start_POSTSUBSCRIPT roman_graph end_POSTSUBSCRIPT , bold_italic_h start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , bold_italic_h start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) . (2)

We name such an encoder as ProgSG-si (Summary Interaction), since it first performs GNN with L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT layers on G𝐺Gitalic_G to obtain a summary, and let the expressive transformer of L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT layers handle the pairwise attention between tokens and that summary embedding, which efficiently allows cross-modality interaction. In other words, the graph is treated as a derivative of the source code whose summary embedding is used to augment the source code sequence. During training, the gradients back-propagate through 𝒉graphsubscript𝒉graph\bm{h}_{\mathrm{graph}}bold_italic_h start_POSTSUBSCRIPT roman_graph end_POSTSUBSCRIPT to the GNN, updating both the GNN and the transformer.

Refer to caption
Figure 2: The overall diagrams of ProgSG. “GNN”, “TF”, and “Dec” refer to Graph Neural Network Layer, Transformer Layer, and Decoder, respectively.

III-C Full Model ProgSG: Leveraging Fine-grained Node Token Interaction

While ProgSG-si enables interaction between the graph and tokens, the graph-level summary is too coarse for the model to fully exploit the information from both modalities. Intuitively, the information exchange between two modalities would be more effective if the interaction happens in node/token level. A straightforward way is to utilize a cross attention module to all node embeddings 𝒉v1,,𝒉v|V|subscript𝒉subscript𝑣1subscript𝒉subscript𝑣𝑉\bm{h}_{v_{1}},\ldots,\bm{h}_{v_{|V|}}bold_italic_h start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , bold_italic_h start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT | italic_V | end_POSTSUBSCRIPT end_POSTSUBSCRIPT and token embeddings 𝒉c1,,𝒉cIsubscript𝒉subscript𝑐1subscript𝒉subscript𝑐𝐼\bm{h}_{c_{1}},\ldots,\bm{h}_{c_{I}}bold_italic_h start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , bold_italic_h start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT. However, since there could be thousands of nodes and tokens for an HLS design, the computation overhead is too expensive. So a more efficient way to leverage fine-grained node token interactions is needed.

Recall that there are auxiliary nodes in the HARP graph that stand for the LLVM-IR blocks (see Sec II-B for more details). Meanwhile, the source code is segmented into multiple chunks so that the length of each chunk is within the input length limit of the transformer. Let 𝒉va1,,𝒉vaNsubscript𝒉subscript𝑣subscript𝑎1subscript𝒉subscript𝑣subscript𝑎𝑁\bm{h}_{v_{a_{1}}},\ldots,\bm{h}_{v_{a_{N}}}bold_italic_h start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , bold_italic_h start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT denote the embeddings of auxiliary block nodes and let 𝒉cs1,,𝒉csMsubscript𝒉subscript𝑐subscript𝑠1subscript𝒉subscript𝑐subscript𝑠𝑀\bm{h}_{c_{s_{1}}},\ldots,\bm{h}_{c_{s_{M}}}bold_italic_h start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , bold_italic_h start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT indicate the embeddings of the summary tokens in source code chunks. Since the auxiliary block nodes in the graph modality and the chunks of source codes provide an intermediate granularity between graph/program and node/token level, we propose to utilize them to conduct a hierarchical node/token interaction, which is illustrated in Figure 3. The information between two modalities are first exchanged between the block nodes and the summary tokens via the following cross-modality message passing mechanism inspired by message passing GNNs:

𝒉vak=𝒉vak+MLP2(jαk,jMLP1(𝒉csj)),𝒉csj=𝒉csj+MLP4(kαj,kMLP3(𝒉vak)),subscriptsuperscript𝒉subscript𝑣subscript𝑎𝑘absentsubscript𝒉subscript𝑣subscript𝑎𝑘subscriptMLP2subscript𝑗subscript𝛼𝑘𝑗subscriptMLP1subscript𝒉subscript𝑐subscript𝑠𝑗subscriptsuperscript𝒉subscript𝑐subscript𝑠𝑗absentsubscript𝒉subscript𝑐subscript𝑠𝑗subscriptMLP4subscript𝑘subscript𝛼𝑗𝑘subscriptMLP3subscript𝒉subscript𝑣subscript𝑎𝑘\begin{array}[]{rl}\bm{h}^{\prime}_{v_{a_{k}}}&=\bm{h}_{v_{a_{k}}}+\mathrm{MLP% }_{2}\Big{(}\sum_{j}\alpha_{k,j}\mathrm{MLP}_{1}(\bm{h}_{c_{s_{j}}})\Big{)},\\ \bm{h}^{\prime}_{c_{s_{j}}}&=\bm{h}_{c_{s_{j}}}+\mathrm{MLP}_{4}\Big{(}\sum_{k% }\alpha_{j,k}\mathrm{MLP}_{3}(\bm{h}_{v_{a_{k}}})\Big{)},\end{array}start_ARRAY start_ROW start_CELL bold_italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL = bold_italic_h start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT + roman_MLP start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_k , italic_j end_POSTSUBSCRIPT roman_MLP start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_h start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) , end_CELL end_ROW start_ROW start_CELL bold_italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL = bold_italic_h start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT + roman_MLP start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT roman_MLP start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( bold_italic_h start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) , end_CELL end_ROW end_ARRAY (3)

where the attention coefficients are computed via a dot product attention with learnable weight matrices 𝑾1d×dsubscript𝑾1superscript𝑑𝑑\bm{W}_{1}\in\mathbb{R}^{d\times d}bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT and 𝑾2d×dsubscript𝑾2superscript𝑑𝑑\bm{W}_{2}\in\mathbb{R}^{d\times d}bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT, αk,j=Softmax((𝑾1𝒉vk)(𝑾2𝒉cj)d).subscript𝛼𝑘𝑗Softmaxsuperscriptsubscript𝑾1subscript𝒉subscript𝑣𝑘topsubscript𝑾2subscript𝒉subscript𝑐𝑗𝑑\alpha_{k,j}=\textrm{Softmax}\left(\frac{(\bm{W}_{1}\bm{h}_{v_{k}})^{\top}(\bm% {W}_{2}\bm{h}_{c_{j}})}{\sqrt{d}}\right).italic_α start_POSTSUBSCRIPT italic_k , italic_j end_POSTSUBSCRIPT = Softmax ( divide start_ARG ( bold_italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_h start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_italic_h start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) .

Then, the exchanged information is propagated to each node and token via a GNN and a transformer layer, respectively. Specifically, for a node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (token cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) that is not a block node (summary token), let 𝒉vi=𝒉visubscriptsuperscript𝒉subscript𝑣𝑖subscript𝒉subscript𝑣𝑖\bm{h}^{\prime}_{v_{i}}=\bm{h}_{v_{i}}bold_italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = bold_italic_h start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT (𝒉ci=𝒉cisubscriptsuperscript𝒉subscript𝑐𝑖subscript𝒉subscript𝑐𝑖\bm{h}^{\prime}_{c_{i}}=\bm{h}_{c_{i}}bold_italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = bold_italic_h start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT), the second step can be written as

𝒉vi′′=𝒉vi+MLP6(jαi,jMLP5(𝒉vj))𝒉c1′′,,𝒉cI′′=Attention(𝒉c1,,𝒉cI)missing-subexpressionsubscriptsuperscript𝒉′′subscript𝑣𝑖subscriptsuperscript𝒉subscript𝑣𝑖subscriptMLP6subscript𝑗subscript𝛼𝑖𝑗subscriptMLP5subscriptsuperscript𝒉subscript𝑣𝑗missing-subexpressionsubscriptsuperscript𝒉′′subscript𝑐1subscriptsuperscript𝒉′′subscript𝑐𝐼𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛subscriptsuperscript𝒉subscript𝑐1subscriptsuperscript𝒉subscript𝑐𝐼\begin{array}[]{cc}&\bm{h}^{\prime\prime}_{v_{i}}=\bm{h}^{\prime}_{v_{i}}+% \mathrm{MLP}_{6}\Big{(}\sum_{j}\alpha_{i,j}\mathrm{MLP}_{5}(\bm{h}^{\prime}_{v% _{j}})\Big{)}\\ &\bm{h}^{\prime\prime}_{c_{1}},\ldots,\bm{h}^{\prime\prime}_{c_{I}}=Attention(% \bm{h}^{\prime}_{c_{1}},\ldots,\bm{h}^{\prime}_{c_{I}})\end{array}start_ARRAY start_ROW start_CELL end_CELL start_CELL bold_italic_h start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = bold_italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + roman_MLP start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT roman_MLP start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ( bold_italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL bold_italic_h start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , bold_italic_h start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( bold_italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , bold_italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARRAY (4)

Such cross-modality interaction enables fine-grained interaction between the two modalities so that more informative embeddings for the final prediction task can be generated. As an additional benefit, the interaction step is significantly more efficient than the full cross-attention because the number of auxiliary nodes and summary tokens is usually small. To allow deep cross-modality interaction, we perform the above node-token message passing L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT times where L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the number of transformer layers, e.g., 6 for the pre-trained CodeT5 model used in our experiments. In each of the L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT layers, ProgSG performs the self-attention encoder on C(aug)superscript𝐶𝑎𝑢𝑔C^{(aug)}italic_C start_POSTSUPERSCRIPT ( italic_a italic_u italic_g ) end_POSTSUPERSCRIPT, and executes GNN on G𝐺Gitalic_G, followed by the node-token interaction.

Refer to caption
Figure 3: Illustration of the node-token message passing mechanism. The cross-modality information is first exchanged via block nodes and block tokens. Then the information is propagated to normal nodes and tokens through the GNN and transformer layers, respectively.

III-D Pretraining GNNs for Graph Modality

Generating ground-truth targets with an HLS simulator is slow, resulting in a scarcity of labeled data. To mitigate this issue, we propose utilizing pre-training tasks. While there is extensive work on pre-training transformer models with code (Wang et al., 2021; Feng et al., 2020; Kanade et al., 2020), our focus is on pre-training GNNs for graph modality. Existing self-supervised tasks for GNNs are for general graphs instead of CDFG; thus, we propose employing data flow analyses as self-supervised learning tasks. Data flow analysis is fundamental to modern compiler technology (Cummins et al., 2021) and necessitates that GNNs extract crucial information from a program’s structure. Furthermore, these tasks can be effectively addressed by non-ML techniques, allowing us to easily obtain a substantial set of labeled data for pre-training.

In particular, we select four data analyses tasks: (1) reachability: if a node can be reached from another node, (2) dominators: if every control-flow path to an instruction node passes through another node, (3) data dependencies: if a variable is defined in an instruction and used in another instruction, and (4) liveness: if a variable is live-out of a statement n𝑛nitalic_n. More detailed definitions of these tasks can be found in (Cummins et al., 2021). These tasks cover a full range of forward and backward analyses, and control and data analyses. In addition, these tasks focus on predicting the relationship between two nodes in a CDFG. Such node-level tasks help the GNN to learn meaningful node embeddings, which is the foundation of generating good graph embeddings. Each task can be viewed as a binary classification problem. Given a pair of nodes vi,vjsubscript𝑣𝑖subscript𝑣𝑗v_{i},v_{j}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and a label yijsubscript𝑦𝑖𝑗y_{ij}italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT which is a binary label indicating if the nodes have a particular relationship, we employ the cross entropy loss for pre-training loss.

Normally after pre-training, we would directly fine-tune the pre-trained GNN for the downstream task. However, the pre-training dataset does not contain any pragma nodes, which is important for predicting the quality of the HLS design. Therefore, we propose to use the pre-trained node embeddings as guidance to train a new (target) GNN for the downstream task. Specifically, given a graph with pragma nodes, denoted as G𝐺Gitalic_G, we would generate a corresponding graph without pragma nodes, designated as Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Then for a node v𝑣vitalic_v that appears in both G𝐺Gitalic_G and Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we would compute its embedding in Gsuperscript𝐺G^{\prime}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with the pre-trained GNN and compute its embedding in G𝐺Gitalic_G with the GNN to be trained. Then, we would maximize the cosine similarity between the two embeddings with the following loss guide=1cosgcont(𝒉v,G),𝒉v,Gsubscriptguide1subscript𝑔contsubscript𝒉𝑣𝐺subscript𝒉𝑣superscript𝐺\mathcal{L}_{\mathrm{guide}}=1-\cos\langle g_{\mathrm{cont}}(\bm{h}_{v,G}),\bm% {h}_{v,G^{\prime}}\ranglecaligraphic_L start_POSTSUBSCRIPT roman_guide end_POSTSUBSCRIPT = 1 - roman_cos ⟨ italic_g start_POSTSUBSCRIPT roman_cont end_POSTSUBSCRIPT ( bold_italic_h start_POSTSUBSCRIPT italic_v , italic_G end_POSTSUBSCRIPT ) , bold_italic_h start_POSTSUBSCRIPT italic_v , italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⟩ where gcontsubscript𝑔contg_{\mathrm{cont}}italic_g start_POSTSUBSCRIPT roman_cont end_POSTSUBSCRIPT is a continuous function (e.g., MLP, identity function). In this way, the target GNN would learn how to extract useful node-level information from the pre-trained GNN, which would in turn improve the quality of graph-level embeddings.

IV Experiments

Here we present the main experiment results. Additional experiments are provided in the Appendix.

IV-A Model Hyperparameters and Training Details

During training, we combine the proposed loss functions including total=task+γ1fineAlign+γ2coarseAlign+γ3guidesubscripttotalsubscripttasksubscript𝛾1subscriptfineAlignsubscript𝛾2subscriptcoarseAlignsubscript𝛾3subscriptguide\mathcal{L}_{\mathrm{total}}=\mathcal{L}_{\mathrm{task}}+\gamma_{1}\mathcal{L}% _{\mathrm{fineAlign}}+\gamma_{2}\mathcal{L}_{\mathrm{coarseAlign}}+\gamma_{3}% \mathcal{L}_{\mathrm{guide}}caligraphic_L start_POSTSUBSCRIPT roman_total end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT roman_task end_POSTSUBSCRIPT + italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_fineAlign end_POSTSUBSCRIPT + italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_coarseAlign end_POSTSUBSCRIPT + italic_γ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_guide end_POSTSUBSCRIPT, where γ𝛾\gammaitalic_γs are hyperparameters controlling the weight for the different loss terms. During inference, we apply the encoder-decoder architecture to obtain 𝒀^^𝒀\hat{\bm{Y}}over^ start_ARG bold_italic_Y end_ARG.

We set the maximum number of tokens to 64 for the tokenizer, and chunk each source code into multiple subsequences to handle the long input source code sequence. We leave the exploration using more advanced modeling for long sequences such as Xiao et al. (2023) as future work. Since the task is on the whole program level, for each subsequence, we use the final embedding of the initial token (“[cls]”) as the summary of each subsequence (for ProgSG-si and ProgSG, an additional MLP is applied to project 𝒉src=𝑯src[0:1]\bm{h}_{\mathrm{src}}=\bm{H}_{\mathrm{src}}[0:1]bold_italic_h start_POSTSUBSCRIPT roman_src end_POSTSUBSCRIPT = bold_italic_H start_POSTSUBSCRIPT roman_src end_POSTSUBSCRIPT [ 0 : 1 ] from dimension 1024 to 512), and aggregate all summaries into a final sequence-level embedding (denoted as 𝒉srcsubscript𝒉src\bm{h}_{\mathrm{src}}bold_italic_h start_POSTSUBSCRIPT roman_src end_POSTSUBSCRIPT in the main paper) which is fed into the decoder. For the two-modality models, the decoder receives the concatenation of 𝒉srcsubscript𝒉src\bm{h}_{\mathrm{src}}bold_italic_h start_POSTSUBSCRIPT roman_src end_POSTSUBSCRIPT and 𝒉CDFGsubscript𝒉CDFG\bm{h}_{\mathrm{CDFG}}bold_italic_h start_POSTSUBSCRIPT roman_CDFG end_POSTSUBSCRIPT as described in the main paper.

The decoder consists of 6 sequentially stacked layers that project the input to a scalar. If the model is of a single modality, the MLP decoder has hidden dimensions 512-256-128-64-32-16-1. If the model is of two modalities, the MLP decoder has hidden dimensions 1024-768-512-256-128-61-1. The above scheme is administered consistently to all the methods for a fair comparison. Since we have 5 target metrics to predict as mentioned in Section 4.1 of the main paper, we use 5 MLPs applied on the input embeddings to transform them into the final 𝒚^5^𝒚superscript5\hat{\bm{y}}\in\mathbb{R}^{5}over^ start_ARG bold_italic_y end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT. We use the Exponential Linear Unit (ELU) function Clevert et al. (2015).

Our framework is implemented with PyTorch, PyTorch Geometric, Transformers, etc555We will release our code and data upon acceptance.. Training is performed on a server with NVIDIA Tesla V100 GPUs. We employ the AdamW optimizer (Loshchilov and Hutter, 2019) with the initial learning rate tuned for each model using a validation set. We perform training with γ1=γ2=γ3=1subscript𝛾1subscript𝛾2subscript𝛾31\gamma_{1}=\gamma_{2}=\gamma_{3}=1italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_γ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 1 over 1000 epochs with the best model selected based on a validation set for final adaptation and testing.

For the pre-trained GNN, we use utilize GNN with 5 transformer convolutional layers (Shi et al., 2021) as encoders and a 2-layer MLP as the decoder for each data analysis task. We use a training set with 276,197 graphs. The β𝛽\betaitalic_β in focal loss is set to 2. We employ a validation set with 500 graphs to select the best pre-trained GNN. The β𝛽\betaitalic_β in focal loss is set to 2.

IV-B Dataset and Evaluation Protocol

For the purpose of this study, we assembled a database of medium-complexity kernels that function as fundamental building blocks for larger applications. We selected a total of 42 kernels from two well-known benchmark suites, namely, the MachSuite benchmark (Reagen et al., 2014) and the Polyhedral benchmark suite (Polybench) (Yuki and Pouchet, [n.d.]). The kernels in the database were chosen to have a broad range of computation intensities, including linear algebra operations on matrices and vectors (e.g., BLAS kernels), data mining kernels (e.g., correlation and covariance), stencil operations, encryption, and a dynamic programming application.

The database is a new version of datasets released in Bai et al. (2023), generated by the AMD/Xilinx HLS tool version 2021 to implement the design, with the AMD/Xilinx Alveo U200 as the target FPGA and a working frequency of 250MHz. For each kernel, we perform a random split with the training, validation, and testing ratio being 70:15:15. For each design point, we recorded the latency in terms of cycle counts, as well as the resource utilization for DSP, BRAM, LUT, and FF. These targets are normalized following the same procedure in Bai et al. (2023); Sohrabizadeh et al. (2022a). The statistics of the dataset are presented in Table II. The dataset will be available upon paper acceptance.

IV-C Model Setup and Hyperparameters

We follow (Sohrabizadeh et al., 2023) to generate the HARP graphs. We adopt L1=8subscript𝐿18L_{1}=8italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 8 layers of TransformerConv (Shi et al., 2021) with a jum** knowledge network (Xu et al., 2018) as the final node embedding aggregation method. The embedding dimension d=512𝑑512d=512italic_d = 512. For the source code, we use CodeT5 (Wang et al., 2021) with L2=6subscript𝐿26L_{2}=6italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 6 layers to embed the source code666Specifically, we use CodeT5-small from https://huggingface.co/Salesforce/codet5-small to initialize the transformer encoder for source code, and fine-tune the whole model.. AutoDSE defines a variable for each pragma, as shown in Code LABEL:code:hls, that is a placeholder for the option of the pragma. Since the pragmas ζ𝜁\zetaitalic_ζ must be reflected in the input source code, for each design, we add the pragma options to their respective variables, e.g., we change “__PARA__L0__” to “__PARA__L0=1”, “__PIPE__L2” to “__PIPE__L2=flatten”, etc. We set the maximum number of tokens to 64 for the tokenizer, and chunk each source code into multiple subsequences to handle the long input source code sequence. The summaries of all subsequences are aggregated into the final representation for the decoder. We report the full hyperparameters in the appendix.

TABLE II: Dataset statistics. “#D”, “#P”, “A#P”, “A#T”, “A#N”, “A#E”, and “A#MP” denote “# designs”, “# programs”, “avg # pragmas per design”, “avg # tokens per program”, “avg # nodes per program’s CDFG”, and “avg # edges per program’s CDFG”, respectively.
Dataset #D #P A#P A#T A#N A#E
Vitis 2021 10,868 40 8.1 1286.3 354.7 1246.4
TABLE III: Rooted mean sqaure error (RMSE) of different methods in predicting target values.
perf util-LUT util-FF util-DSP util-BRAM total
Single Modalities Model Code2vec 1.0641 0.5462 0.3103 0.9989 0.1555 3.6150
HARP 0.2671 0.1043 0.0565 0.1584 0.0611 0.6474
CodeT5 0.2077 0.0985 0.0619 0.1881 0.0597 0.6159
Cross Modalities Model GreaseLM 0.2033 0.0805 0.0499 0.1349 0.0459 0.5146
ProgSG-ca 0.2181 0.1232 0.0532 0.1381 0.0334 0.5660
ProgSG-si 0.1591 0.1630 0.0514 0.1558 0.0335 0.5628
ProgSG 0.1481 0.0709 0.0406 0.1084 0.0242 0.3923

IV-D Performance Prediction Results

We compare the accuracy of performance prediction of ProgSG against three categories of baselines: (1) models of source code modality, Code2vec (Alon et al., 2019) and CodeT5 (Wang et al., 2021); (2) models of graph modality, HARP (Sohrabizadeh et al., 2023); and (3) models of both modalities, GreaseLM (Zhang et al., 2022). GreaseLM is designed for knowledge graph augmented question-task. It exchanges information from two modalities in the program/graph level, which is too coarse. We also include ProgSG-ca, which is a simple concatenation of the summary representations described (Section III-A), and ProgSG-si which combines the two modalities without fine-grained interaction (Section III-B).

Table III provides a detailed breakdown of the prediction accuracy across different target variables. Notably, our results consistently reveal that the cross-modality model outperforms the single-modality model in terms of rooted mean square error (RMSE). This finding strongly supports our argument for the benefits of integrating multiple modalities within our model architecture. Furthermore, the comparison between the error rates of ProgSG-si and ProgSG-ca highlights the effectiveness of our graph-summary-augmented sequence representation. Moreover, ProgSG surpasses ProgSG-ca, ProgSG-si, and GreaseLM. This outcome underscores the superiority of our fine-grained node token interaction module, enabling more accurate predictions across a diverse range of target variables. In summary, our experimental results validate the effectiveness of our novel cross-modality program encoder.

IV-E Design Space Exploration Results

In addition, we evaluate how our method performs in finding the best design of a given kernel, i.e., design space exploration. Following previous studies (Sohrabizadeh et al., 2022a, 2023), for each kernel we have each model to verify as many designs points as possible in an hour following a heuristic order. The design points with the top 10 predicted performance is recorded. Then we run an HLS simulation to get the ground-truth performance of the selected design points and compare them with the best design point found by running AutoDSE (Sohrabizadeh et al., 2022b) for 25 hours. We use the average speedup between design points found by the model and those found by AutoDSE as the metric to evaluate the performance of each model on DSE task. Figure 4(a) shows the average and geomean of the DSE performance of HARP, CodeT5, and ProgSG. ProgSG outperforms the two single-modality baselines, revealing that our cross-modality model is superior. In addition, CodeT5 is worse than HARP in the DSE task, though it has a smaller RMSE in the regression task. We think it is because running CodeT5 for inference is slower than running HARP, as the GNN is much smaller. As a result, the number of designs verified by CodeT5 is smaller. Meanwhile, although ProgSG also suffers from the slow inference, it still manages to find better design points due to its better prediction accuracy.

One way to handle the slow inference speed of CodeT5 and ProgSG is to do a two-level design space exploration. That is, HARP is first run for an hour to find 1,000 candidate designs with the best predicted performance, then the larger model (CodeT5 or ProgSG) is used to select the top 10 designs from them. This two-level approach can simultaneously utilize the efficiency of the GNN model and the effectiveness of larger cross-modality model. Figure 4(b) shows the DSE performance of this two-level approach. It is clear that the performance of CodeT5 and ProgSG are significantly improved, showing the advantage of the two-level design space exploration. However, CodeT5 still cannot outperform HARP, demonstrating that LLM itself might not be powerful enough for our task.

Refer to caption
(a) Running each model for one hour.
Refer to caption
(b) Running each model on 1K candidates returned by HARP.
Figure 4: Relative performance improvement of best design found by our model compared to running AutoDSE for twenty-five hours.

IV-F Effects of Pre-training

In addition to our main analysis, we conducted an ablation study to delve deeper into the impact of our Graph Neural Network (GNN) pre-training strategy on model prediction accuracy. The results, as presented in Table IV, provide insightful observations. Notably, while there is a slight decrease in the accuracy of performance prediction, the prediction accuracy for the other four targets shows a notable improvement ranging from 10% to 15%. The drop of accuracy in performance is because we train the model to predict multiple targets simultaneously, and the accuracy of one target might drop while the overall prediction effectiveness improves. Furthermore, when considering the overall prediction accuracy, we observe an improvement of 5.57%. This substantial boost reaffirms the efficacy of our pre-training approach in refining the model’s effectiveness across diverse prediction tasks.

TABLE IV: Effects of pre-training to the prediction RMSE of our model.
Targets wo pretrain with pretrain relative impr.
perf 0.1387 0.1481 -6.8%
util-LUT 0.0830 0.0709 14.6%
util-FF 0.0461 0.0406 11.9%
util-DSP 0.1084 0.1022 9.97%
util-BRAM 0.0281 0.0242 13.9%
total 0.4163 0.3923 5.77%

IV-G Attention Visualization

To better understand if the transformer model learns to attend tokens that are relevant to HLS pragma configurations, we visualize the average attention scores of some of the pragma-related tokens in the Gemm-n kernel (shown in Code LABEL:code:hls) for the transformer before and after training for our regression task (illustrated in Figure 5). We can see that 11 out of 15 tokens have higher attention scores after fine-tuning. For the 4 tokens that have lower attention scores (i.e., “__PIPE__”, “ACCEL”, “TILE”, and “FACTOR”), we can see that they often appear simultaneously with other keywords such as “PIPELINE”, “__TILE__”, and “PARALLEL”, which makes them somewhat redundant. If we compute the summation of their attention scores with the attention scores of tokens that simultaneously appear with them (e.g., “__PIPE__” and “PIPELINE”), we find that the summed attention score increases after training. So the changes in attention score suggest that the transformer model does learn to attend to the pragma-related tokens, which are important to predicting the quality of an HLS pragma configuration, even though these tokens are not included in its pre-training stage.

void gemm_N(double m1[4096],double m2[4096],double prod[4096])
{
int i,j,k,k_col,i_col;
double mult;
#pragma ACCEL PIPELINE auto{__PIPE__L0}
#pragma ACCEL TILE FACTOR=auto{__TILE__L0}
#pragma ACCEL PARALLEL FACTOR=auto{__PARA__L0}
for (i = 0; i < 64; i++) {
#pragma ACCEL PIPELINE auto{__PIPE__L1}
#pragma ACCEL TILE FACTOR=auto{__TILE__L1}
#pragma ACCEL PARALLEL FACTOR=auto{__PARA__L1}
for (j = 0; j < 64; j++) {
i_col = i * 64;
double sum = (double )0;
#pragma ACCEL PARALLEL reduction=sum FACTOR=auto{__PARA__L2}
for (k = 0; k < 64; k++) {
k_col = k * 64;
mult = m1[i_col + k] * m2[k_col + j];
sum += mult;
}
prod[i_col + j] = sum;
}}}
Code 1: Code snippet of the Gemm-Ncubed kernel with its pragmas starting with “#pragma”.
Refer to caption
Figure 5: Bar plots of the average attention scores of pragma-related tokens before (ProgSG-rand) and after (ProgSG) being fine-tuned.

IV-H Training with Multiple Versions of Data

In addition, HARP Sohrabizadeh et al. (2023) revealed that training the model with data obtained through multiple versions of HLS tools can improve the performance of the model. In their experiments, HARP is first trained with data of one version, then fine-tuned with data of another version. To investigate if the conclusion is true for ProgSG, we conduct a similar experiment with data of three different versions using HARP and ProgSG. Each model is first trained with data of the first version (HLS v18) for 1,000 epochs, then fine-tuned with the data of the second version (HLS v20) for 200 epochs, and finally fine-tuned with the data of third versions (HLS v21) for 400 epochs. Figure 6 illustrates the DSE performance of HARP and ProgSG trained with 1 version and 3 versions of data. It is clear that training ProgSG with multiple versions of data significantly increases its performance.

Refer to caption
Figure 6: DSE results of HARP and ProgSG trained with 1 version (v21, denoted as 1V) and three versions (v18, v20, and v21, designated as 3V) of HLS tools.
Refer to caption
(a) HARP embeddings
Refer to caption
(b) CodeT5 embeddings
Refer to caption
(c) ProgSG embeddings
Figure 7: Embedding visualizations with different methods for “Correlation” kernel using t-SNE. The color indicates the value of “perf” target.

IV-I Embedding Visualization

To gain further insight into why ProgSG outperforms CodeT5 and HARP, we visualize the embeddings of valid “correlation” kernel designs in Figure 7. The colors represent the ground-truth performance targets. All methods form distinctive clusters with similar performance within each cluster. However, HARP’s clusters are more crowded, likely due to the larger sizes of CodeT5 and ProgSG, which can better differentiate designs. Additionally, ProgSG’s clusters align more closely with performance targets, as evidenced by the closer proximity of the purple points in ProgSG’s embeddings compared to those in CodeT5’s embeddings. This suggests that ProgSG’s embeddings more accurately reflect design performance, thereby explaining its superior DSE and prediction results.

V Related Work

Machine Learning for Electronic Design Automation  Machine learning (ML) for electronic design automation (EDA) is a rising research area (Huang et al., 2021a) with applications at various stages of hardware design, such as design verification (Vasudevan et al., 2021; Xu et al., 2020; Liang et al., 2023), high-level synthesis (HLS) (Ustun et al., 2020; Sohrabizadeh et al., 2022a; Bai et al., 2022; Wu et al., 2022a; Sohrabizadeh et al., 2023; Fu et al., 2023), circuit design (Ren et al., 2020; Wang et al., 2022, 2020; Yang et al., 2022), etc. This work focuses on obtaining representations of HLS designs using information from both the source code and the CDFG graph for FPGA design quality regression. Many works depict the input design/circuit as graphs (Ustun et al., 2020; Ren et al., 2020; Sohrabizadeh et al., 2022a). Recently, large language models (LLMs) are used to directly generate EDA scripts (Liu et al., 2023a, b). However, their results show that LLMs can only generate a few lines of scripts without considering the quality of the design. This work is among the first to combine both the source code and the graph modalities.

Representation Learning for Programs  Based on the modality of data, current methods can be divided into source-code-based methods and data-structure-based methods. Source-code-based methods (Kanade et al., 2020; Feng et al., 2020; Wang et al., 2021; Svyatkovskiy et al., 2020) employ language models (Devlin et al., 2018; Radford et al., 2019; Raffel et al., 2020; Zheng et al., 2023; Li et al., 2023a; Gunasekar et al., 2023; Roziere et al., 2023; Fu et al., 2024) on source code to perform various types of tasks. However, it has not been demonstrated that these language models can predict the program’s runtime, let alone predicting the corresponding hardware design performance. The data-structure-based methods (Alon et al., 2019; Sohrabizadeh et al., 2022a, 2023) obtain the program embeddings from the data structure that represents a program. But the sizes of the models are usually small, restricting their prediction ability.

Multi-modal Learning with Transformers  Modality-wise, transformers have been employed in cross-modality tasks spanning across vision (Li et al., 2021; Wu et al., 2021; Huang et al., 2021b), language (Zhang et al., 2019; Lee et al., 2022), source code (Dai et al., 2022), knowledge graphs (Yasunaga et al., 2022; Rao et al., 2023), audio (Arandjelovic and Zisserman, 2017; Gan et al., 2020), point clouds (Afham et al., 2022), etc. In fact, multi-modal learning using transformers has recently been considered possible for achieving generalist artificial intelligence (Moor et al., 2023; Mai et al., 2023). More thorough surveys on graphs and transformers can be found at Liu et al. (2023c); Li et al. (2023b); ** et al. (2023).

GreaseLM (Zhang et al., 2022) combines GNN and transformer for knowledge-graph augmented QA task. It is similar to our task in that it also aims to combine the graph and the text modality. However, there are some differences in the targeted tasks that lead to distinct challenges and model design choices: (1) The differences in program structures are subtle. Two designs with the same functionality can have a huge performance gap due to slight differences in the program or pragmas. So the global-level interaction in GreaseLM is not effective enough. A more fine-grained interaction between the modalities is needed. (2) Efficiency is important to our task. As a larger inference overhead means slower design space exploration. Our program-derived graph is also much larger, with thousands of nodes and tokens. So the model needs to be efficient, forbidding the usage of full cross-attention between nodes and tokens. (3) The programs are inherently hierarchical, making it possible to do interactions at multiple levels. Given these unique challenges, we propose a novel model that interacts modalities at both global and block levels to maintain effectiveness while kee** it efficient. It is fair to say that we are the first to explore the cross-modality model for program representation learning.

Graph Neural Networks Pre-training   Existing self-supervised learning methods (Xie et al., 2023) can be divided into two categories: contrastive methods (Sun et al., 2019; You et al., 2020; Veličković et al., 2019; Qiu et al., 2020) and predictive methods (Xie et al., 2023; Kipf and Welling, 2016; Hu et al., 2020; Peng et al., 2020; Rong et al., 2020; Sun et al., 2020; Hu et al., 2021). To our knowledge, we are the first to explore pre-training GNNs with CDFGs.

VI Conclusion

We propose ProgSG, a novel two-modality program representation learning method for IC design (defined with HLS C/C++) optimization. The key assumption is that there is critical information in both the source code modality and the assembly code modality, which must be captured jointly. To achieve that, we propose a graph-summary-augmented sequence representation for the source code transformer, a fine-grained alignment utilization method, and a novel pre-training method for the GNN encoder for the CDFG. Experiments confirm the superiority of the proposed ProgSG over baselines. We believe the core idea of using both modalities together with their alignment is general and can be adapted for other tasks.

References

  • (1)
  • Afham et al. (2022) Mohamed Afham, Isuru Dissanayake, Dinithi Dissanayake, Amaya Dharmasiri, Kanchana Thilakarathna, and Ranga Rodrigo. 2022. Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9902–9912.
  • Alon et al. (2019) Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages 3, POPL (2019), 1–29.
  • Arandjelovic and Zisserman (2017) Relja Arandjelovic and Andrew Zisserman. 2017. Look, listen and learn. In Proceedings of the IEEE international conference on computer vision. 609–617.
  • Bai et al. (2023) Yunsheng Bai, Atefeh Sohrabizadeh, Zongyue Qin, Ziniu Hu, Yizhou Sun, and Jason Cong. 2023. Towards a Comprehensive Benchmark for High-Level Synthesis Targeted to FPGAs. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  • Bai et al. (2022) Yunsheng Bai, Atefeh Sohrabizadeh, Yizhou Sun, and Jason Cong. 2022. Improving GNN-based accelerator design automation with meta learning. In Proceedings of the 59th ACM/IEEE Design Automation Conference. 1347–1350.
  • Clevert et al. (2015) Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. 2015. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289 (2015).
  • Cummins et al. (2021) Chris Cummins, Zacharias V Fisches, Tal Ben-Nun, Torsten Hoefler, Michael FP O’Boyle, and Hugh Leather. 2021. Programl: A graph-based program representation for data flow analysis and compiler optimizations. In International Conference on Machine Learning. PMLR, 2244–2253.
  • Dai et al. (2022) Yong Dai, Duyu Tang, Liangxin Liu, Minghuan Tan, Cong Zhou, **gquan Wang, Zhangyin Feng, Fan Zhang, Xueyu Hu, and Shuming Shi. 2022. One model, multiple modalities: A sparsely activated approach for text, sound, image, video and code. arXiv preprint arXiv:2205.06126 (2022).
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  • Feng et al. (2020) Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020).
  • Fu et al. (2024) Weimin Fu, Shijie Li, Yifang Zhao, Haocheng Ma, Raj Dutta, Xuan Zhang, Kaichen Yang, Yier **, and Xiaolong Guo. 2024. Hardware Phi-1.5 B: A Large Language Model Encodes Hardware Domain Specific Knowledge. arXiv preprint arXiv:2402.01728 (2024).
  • Fu et al. (2023) Yonggan Fu, Yongan Zhang, Zhongzhi Yu, Sixu Li, Zhifan Ye, Chaojian Li, Cheng Wan, and Yingyan Celine Lin. 2023. Gpt4aigchip: Towards next-generation ai accelerator design automation via large language models. In 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD). IEEE, 1–9.
  • Gan et al. (2020) Chuang Gan, Deng Huang, Hang Zhao, Joshua B Tenenbaum, and Antonio Torralba. 2020. Music gesture for visual sound separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10478–10487.
  • Gunasekar et al. (2023) Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, et al. 2023. Textbooks Are All You Need. arXiv preprint arXiv:2306.11644 (2023).
  • Guo et al. (2020) Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. 2020. Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366 (2020).
  • Hu et al. (2020) Ziniu Hu, Yuxiao Dong, Kuansan Wang, Kai-Wei Chang, and Yizhou Sun. 2020. GPT-GNN: Generative Pre-Training of Graph Neural Networks. In Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
  • Hu et al. (2021) Zhihui Hu, Guang Kou, Haoyu Zhang, Na Li, Ke Yang, and Lin Liu. 2021. Rectifying Pseudo Labels: Iterative Feature Clustering for Graph Representation Learning. In Proceedings of the 30th ACM International Conference on Information and Knowledge Management (Virtual Event, Queensland, Australia) (CIKM ’21). Association for Computing Machinery, New York, NY, USA, 720–729. https://doi.org/10.1145/3459637.3482469
  • Huang et al. (2021a) Guyue Huang, **gbo Hu, Yifan He, Jialong Liu, Mingyuan Ma, Zhaoyang Shen, Juejian Wu, Yuanfan Xu, Hengrui Zhang, Kai Zhong, et al. 2021a. Machine learning for electronic design automation: A survey. ACM Transactions on Design Automation of Electronic Systems (TODAES) 26, 5 (2021), 1–46.
  • Huang et al. (2021b) Zhenyu Huang, Guocheng Niu, Xiao Liu, Wenbiao Ding, Xinyan Xiao, Hua Wu, and Xi Peng. 2021b. Learning with noisy correspondence for cross-modal matching. NeurIPS 34 (2021), 29406–29419.
  • ** et al. (2023) Bowen **, Gang Liu, Chi Han, Meng Jiang, Heng Ji, and Jiawei Han. 2023. Large Language Models on Graphs: A Comprehensive Survey. arXiv preprint arXiv:2312.02783 (2023).
  • Kanade et al. (2020) Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, and Kensen Shi. 2020. Learning and Evaluating Contextual Embedding of Source Code. In Proceedings of the 37th International Conference on Machine Learning (Proceedings of Machine Learning Research), Hal Daumé III and Aarti Singh (Eds.), Vol. 119. PMLR, 5110–5121. https://proceedings.mlr.press/v119/kanade20a.html
  • Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Variational graph auto-encoders. NIPS Workshop on Bayesian Deep Learning (2016).
  • Lattner and Adve (2004) Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In International Symposium on CGO.
  • Lee et al. (2022) Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. 2022. Pix2Struct: Screenshot parsing as pretraining for visual language understanding. arXiv preprint arXiv:2210.03347 (2022).
  • Li et al. (2021) Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. 2021. Align before fuse: Vision and language representation learning with momentum distillation. NeurIPS 34 (2021), 9694–9705.
  • Li et al. (2023a) Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. 2023a. StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023).
  • Li et al. (2023b) Yuhan Li, Zhixun Li, Peisong Wang, Jia Li, Xiangguo Sun, Hong Cheng, and Jeffrey Xu Yu. 2023b. A survey of graph meets large language model: Progress and future directions. arXiv preprint arXiv:2311.12399 (2023).
  • Liang et al. (2023) Rongjian Liang, Nathaniel Pinckney, Yuji Chai, Haoxin Ren, and Brucek Khailany. 2023. Late Breaking Results: Test Selection For RTL Coverage By Unsupervised Learning From Fast Functional Simulation. In 2023 60th ACM/IEEE Design Automation Conference (DAC). IEEE, 1–2.
  • Liu et al. (2023c) Jiawei Liu, Cheng Yang, Zhiyuan Lu, Junze Chen, Yibo Li, Mengmei Zhang, Ting Bai, Yuan Fang, Lichao Sun, Philip S Yu, et al. 2023c. Towards graph foundation models: A survey and beyond. arXiv preprint arXiv:2310.11829 (2023).
  • Liu et al. (2023a) Mingjie Liu, Teodor-Dumitru Ene, Robert Kirby, Chris Cheng, Nathaniel Pinckney, Rongjian Liang, Jonah Alben, Himyanshu Anand, Sanmitra Banerjee, Ismet Bayraktaroglu, et al. 2023a. Chipnemo: Domain-adapted llms for chip design. arXiv preprint arXiv:2311.00176 (2023).
  • Liu et al. (2023b) Mingjie Liu, Nathaniel Pinckney, Brucek Khailany, and Haoxing Ren. 2023b. Verilogeval: Evaluating large language models for verilog code generation. In 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD). IEEE, 1–8.
  • Loshchilov and Hutter (2019) Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. ICLR (2019).
  • Mai et al. (2023) Gengchen Mai, Weiming Huang, ** Sun, Suhang Song, Deepak Mishra, Ninghao Liu, Song Gao, Tianming Liu, Gao Cong, Yingjie Hu, et al. 2023. On the opportunities and challenges of foundation models for geospatial artificial intelligence. arXiv preprint arXiv:2304.06798 (2023).
  • Moor et al. (2023) Michael Moor, Oishi Banerjee, Zahra Shakeri Hossein Abad, Harlan M Krumholz, Jure Leskovec, Eric J Topol, and Pranav Rajpurkar. 2023. Foundation models for generalist medical artificial intelligence. Nature 616, 7956 (2023), 259–265.
  • Peng et al. (2020) Zhen Peng, Yixiang Dong, Minnan Luo, Xiao-Ming Wu, and Qinghua Zheng. 2020. Self-Supervised Graph Representation Learning via Global Context Prediction.
  • Qiu et al. (2020) Jiezhong Qiu, Qibin Chen, Yuxiao Dong, **g Zhang, Hongxia Yang, Ming Ding, Kuansan Wang, and Jie Tang. 2020. GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training. arXiv preprint arXiv:2006.09963 (2020).
  • Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
  • Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.
  • Rao et al. (2023) Jiahua Rao, Zifei Shan, Longpo Liu, Yao Zhou, and Yuedong Yang. 2023. Retrieval-based Knowledge Augmented Vision Language Pre-training. arXiv preprint arXiv:2304.13923 (2023).
  • Reagen et al. (2014) Brandon Reagen, Robert Adolf, Yakun Sophia Shao, Gu-Yeon Wei, and David Brooks. 2014. Machsuite: Benchmarks for accelerator design and customized architectures. In IISWC.
  • Ren et al. (2020) Haoxing Ren, George F Kokai, Walker J Turner, and Ting-Sheng Ku. 2020. ParaGraph: Layout parasitics and device parameter prediction using graph neural networks. In 2020 57th ACM/IEEE Design Automation Conference (DAC). IEEE, 1–6.
  • Rong et al. (2020) Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, Wenbing Huang, and Junzhou Huang. 2020. Self-Supervised Graph Transformer on Large-Scale Molecular Data. Advances in Neural Information Processing Systems 33 (2020).
  • Roziere et al. (2023) Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, **gyu Liu, Tal Remez, Jérémy Rapin, et al. 2023. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023).
  • Shi et al. (2021) Yunsheng Shi, Zhengjie Huang, Shikun Feng, Hui Zhong, Wen** Wang, and Yu Sun. 2021. Masked label prediction: Unified message passing model for semi-supervised classification. IJCAI (2021).
  • Sohrabizadeh et al. (2022a) Atefeh Sohrabizadeh, Yunsheng Bai, Yizhou Sun, and Jason Cong. 2022a. Automated accelerator optimization aided by graph neural networks. In Proceedings of the 59th ACM/IEEE Design Automation Conference. 55–60.
  • Sohrabizadeh et al. (2023) Atefeh Sohrabizadeh, Yunsheng Bai, Yizhou Sun, and Jason Cong. 2023. Robust GNN-Based Representation Learning for HLS. In 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD). IEEE, 1–9.
  • Sohrabizadeh et al. (2022b) Atefeh Sohrabizadeh, Cody Hao Yu, Min Gao, and Jason Cong. 2022b. AutoDSE: Enabling Software Programmers to Design Efficient FPGA Accelerators. ACM Transactions on Design Automation of Electronic Systems (TODAES) 27, 4 (2022), 1–27.
  • Sun et al. (2019) Fan-Yun Sun, Jordan Hoffman, Vikas Verma, and Jian Tang. 2019. InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization. In International Conference on Learning Representations.
  • Sun et al. (2020) Ke Sun, Zhouchen Lin, and Zhanxing Zhu. 2020. Multi-Stage Self-Supervised Learning for Graph Convolutional Networks on Graphs with Few Labeled Nodes.. In AAAI. 5892–5899.
  • Svyatkovskiy et al. (2020) Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. 2020. Intellicode compose: Code generation using transformer. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1433–1443.
  • Ustun et al. (2020) Ecenur Ustun, Chenhui Deng, Debjit Pal, Zhi**g Li, and Zhiru Zhang. 2020. Accurate operation delay prediction for FPGA HLS using graph neural networks. In Proceedings of the 39th International Conference on Computer-Aided Design. 1–9.
  • Vasudevan et al. (2021) Shobha Vasudevan, Wenjie Joe Jiang, David Bieber, Rishabh Singh, C Richard Ho, Charles Sutton, et al. 2021. Learning semantic representations to verify hardware designs. NeurIPS 34 (2021), 23491–23504.
  • Veličković et al. (2019) Petar Veličković, William Fedus, William L. Hamilton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm. 2019. Deep Graph Infomax. In International Conference on Learning Representations. https://openreview.net/forum?id=rklz9iAcKQ
  • Wang et al. (2020) Hanrui Wang, Kuan Wang, Jiacheng Yang, Linxiao Shen, Nan Sun, Hae-Seung Lee, and Song Han. 2020. GCN-RL circuit designer: Transferable transistor sizing with graph neural networks and reinforcement learning. In 2020 57th ACM/IEEE Design Automation Conference (DAC). IEEE, 1–6.
  • Wang et al. (2022) Haoyu Peter Wang, Nan Wu, Hang Yang, Cong Hao, and Pan Li. 2022. Unsupervised Learning for Combinatorial Optimization with Principled Objective Relaxation. In NeurIPS.
  • Wang et al. (2021) Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. EMNLP (2021).
  • Wu et al. (2022a) Nan Wu, Yuan Xie, and Cong Hao. 2022a. Ironman-pro: Multi-objective design space exploration in hls via reinforcement learning and graph neural network based modeling. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2022).
  • Wu et al. (2022b) Nan Wu, Hang Yang, Yuan Xie, Pan Li, and Cong Hao. 2022b. High-level synthesis performance prediction using gnns: Benchmarking, modeling, and advancing. In Proceedings of the 59th ACM/IEEE Design Automation Conference. 49–54.
  • Wu et al. (2021) Peng Wu, Xiangteng He, Mingqian Tang, Yiliang Lv, and **g Liu. 2021. Hanet: Hierarchical alignment networks for video-text retrieval. In Proceedings of the 29th ACM international conference on Multimedia. 3518–3527.
  • Xiao et al. (2023) Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2023. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453 (2023).
  • Xie et al. (2023) Yaochen Xie, Zhao Xu, **gtun Zhang, Zhengyang Wang, and Shuiwang Ji. 2023. Self-Supervised Learning of Graph Neural Networks: A Unified Review. IEEE Trans. Pattern Anal. Mach. Intell. 45, 2 (2023), 2412–2429. https://doi.org/10.1109/TPAMI.2022.3170559
  • Xu et al. (2018) Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka. 2018. Representation learning on graphs with jum** knowledge networks. ICML (2018).
  • Xu et al. (2020) Peng Xu, Alejandro Salado, and Guangrui Xie. 2020. A reinforcement learning approach to design verification strategies of engineered systems. In 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, 3543–3550.
  • Yang et al. (2022) Tai Yang, Guoqing He, and Peng Cao. 2022. Pre-routing path delay estimation based on transformer and residual framework. In 2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, 184–189.
  • Yasunaga et al. (2022) Michihiro Yasunaga, Antoine Bosselut, Hongyu Ren, Xikun Zhang, Christopher D Manning, Percy S Liang, and Jure Leskovec. 2022. Deep bidirectional language-knowledge graph pretraining. Advances in Neural Information Processing Systems 35 (2022), 37309–37323.
  • You et al. (2020) Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. 2020. Graph Contrastive Learning with Augmentations. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 5812–5823. https://proceedings.neurips.cc/paper/2020/file/3fe230348e9a12c13120749e3f9fa4cd-Paper.pdf
  • Yuki and Pouchet ([n.d.]) Tomofumi Yuki and Louis-Noël Pouchet. [n.d.]. PolyBench/C. https://web.cse.ohio-state.edu/~pouchet.2/software/polybench/
  • Zhang et al. (2022) Xikun Zhang, Antoine Bosselut, Michihiro Yasunaga, Hongyu Ren, Percy Liang, Christopher D Manning, and Jure Leskovec. 2022. Greaselm: Graph reasoning enhanced language models. In International conference on learning representations.
  • Zhang et al. (2019) Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. 2019. ERNIE: Enhanced language representation with informative entities. ACL (2019).
  • Zheng et al. (2023) Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Zihan Wang, Lei Shen, Andi Wang, Yang Li, et al. 2023. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. arXiv preprint arXiv:2303.17568 (2023).

-A Additional Background in HLS Design

void kernel_mvt(double x1[120], double x2[120], double y_1[120], double y_2[120], double A[120][120]) {
int i, j;
#pragma ACCEL PIPELINE auto{__PIPE__L0}
#pragma ACCEL TILE FACTOR=auto{__TILE__L0}
#pragma ACCEL PARALLEL FACTOR=auto{__PARA__L0}
for (i = 0; i < 120; i++) {
#pragma ACCEL PARALLEL reduction = x1 FACTOR=auto{__PARA__L2}
for (j = 0; j < 120; j++) {
x1[i] += A[i][j] * y_1[j];
}}
#pragma ACCEL PIPELINE auto{__PIPE__L1}
#pragma ACCEL TILE FACTOR=auto{__TILE__L1}
#pragma ACCEL PARALLEL FACTOR=auto{__PARA__L1}
for (i = 0; i < 120; i++) {
#pragma ACCEL PARALLEL reduction = x2 FACTOR=auto{__PARA__L3}
for (j = 0; j < 120; j++) {
x2[i] += A[j][i] * y_2[j];
}}}
Code 2: Code snippet of the mvt kernel (Matrix Vector Product and Transpose) with its 8 pragmas starting with “#pragma”.

-B Extra Embedding Visualization

In Figure 8 we visualize the embeddings of different models for “symm-opt-medium” kernel, where ProgSG achieves more than eight times speed up compared to HARP. Similar to the embeddings of “Correlation” kernel, the embeddings of HARP are more crowded, suggesting weaker generalization ability. Moreover, comparing the embeddings of CodeT5 and ProgSG, we can see that the yellow point (which represents the design point with the best performance) in ProgSG’s embeddings are further away from other points than in CodeT5’s embeddings. It suggests ProgSG can better distinguish the good design points.

Refer to caption
(a) HARP embeddings
Refer to caption
(b) CodeT5 embeddings
Refer to caption
(c) ProgSG embeddings
Figure 8: Embedding visualizations with different methods for “Symm-Opt-Medium kernel” using t-SNE. The color indicates the value of “perf” target.

-C Case Studies of Best Design Points in DSE experiments

In this section, we show the design points returned by AutoDSE (25 hours), HARP (1 hour), and ProgSG (1 hour) in the DSE experiments for some kernels (“Correlation”, “Symm-opt-medium”, “Gemver-medium”) where ProgSG outperforms HARP significantly. We show the source code of these kernels in Code LABEL:code:hls-cor, Code LABEL:code:hls-symm, and Code LABEL:code:hls-gemver. And we show the design points in Table V, Table VI, and Table VII.

For correlation kernel, the values of “__PARA__L5” in AutoDSE and ProgSG’s design points are much larger than the value in HARP’s design point. So the design point returned by HARP leads to sub-optimal data loading procedures, which takes up 96,000 cycles of the total latency. While the design point returned by ProgSG does not have this issue. For symm-opt-medium kernel, the design point returned by HARP has smaller parallelization factor than the design points returned by AutoDSE and ProgSG, leading to worse efficiency. For gemver-medium kernel, the parameter “__PARA__L4” is 64 in the design point returned by HARP, which can not divide 400, which is the total number of the for-loop. As the result, the loop takes up 120,000 cycles of the total latency, leading to worse performance. On the other hand, The “__PARA__L4” is 25 in the design point returned by ProgSG, which can divide 400, thus avoiding the problem.

void kernel_correlation(double float_n,double data[100][80],double corr[80][80],double mean[80],double stddev[80])
{
int i;
int j;
int k;
double eps = 0.1;
#pragma ACCEL PIPELINE auto{__PIPE__L0}
#pragma ACCEL TILE FACTOR=auto{__TILE__L0}
#pragma ACCEL PARALLEL FACTOR=auto{__PARA__L0}
for (j = 0; j < 80; j++) {
mean[j] = 0.0;
#pragma ACCEL PARALLEL FACTOR=auto{__PARA__L4}
for (i = 0; i < 100; i++) {
mean[j] += data[i][j];
}
mean[j] /= float_n;
}
#pragma ACCEL PIPELINE auto{__PIPE__L1}
#pragma ACCEL TILE FACTOR=auto{__TILE__L1}
#pragma ACCEL PARALLEL FACTOR=auto{__PARA__L1}
for (j = 0; j < 80; j++) {
stddev[j] = 0.0;
#pragma ACCEL PARALLEL FACTOR=auto{__PARA__L5}
for (i = 0; i < 100; i++) {
stddev[j] += pow(data[i][j] - mean[j],(double )2);
}
stddev[j] /= float_n;
stddev[j] = sqrt(stddev[j]);
stddev[j] = (stddev[j] <= eps?1.0 : stddev[j]);
}
#pragma ACCEL PIPELINE auto{__PIPE__L2}
#pragma ACCEL TILE FACTOR=auto{__TILE__L2}
#pragma ACCEL PARALLEL FACTOR=auto{__PARA__L2}
for (i = 0; i < 100; i++) {
#pragma ACCEL PARALLEL FACTOR=auto{__PARA__L6}
for (j = 0; j < 80; j++) {
data[i][j] -= mean[j];
data[i][j] /= sqrt(float_n) * stddev[j];
}
}
#pragma ACCEL PIPELINE auto{__PIPE__L3}
#pragma ACCEL TILE FACTOR=auto{__TILE__L3}
#pragma ACCEL PARALLEL FACTOR=auto{__PARA__L3}
for (i = 0; i < 80 - 1; i++) {
corr[i][i] = 1.0;
#pragma ACCEL PIPELINE auto{__PIPE__L7}
for (j = i + 1; j < 80; j++) {
corr[i][j] = 0.0;
#pragma ACCEL PARALLEL FACTOR=auto{__PARA__L7_0}
for (k = 0; k < 100; k++) {
corr[i][j] += data[k][i] * data[k][j];
}
corr[j][i] = corr[i][j];
}
}
corr[80 - 1][80 - 1] = 1.0;
}
Code 3: Code snippet of the Correlation kernel with its pragmas starting with “#pragma”.
AutoDSE HARP ProgSG
__PARA__L0 1 1 1
__PARA__L1 1 1 1
__PARA__L2 1 1 1
__PARA__L3 1 1 1
__PARA__L4 1 5 4
__PARA__L5 32 5 25
__PARA__L6 1 10 4
__PARA__L7_0 1 1 1
__PIPE__L0 fg off off
__PIPE__L1 off off off
__PIPE__L2 off off off
__PIPE__L3 off off off
__PIPE__L7 fg fg fg
__TILE__L0 1 1 1
__TILE__L1 1 1 1
__TILE__L2 1 1 1
__TILE__L3 1 1 1
perf 60,237 165,135 61,287
TABLE V: Best design points returned by AutoDSE, HARP, and ProgSG on “Correlation” kernel.
void kernel_symm(double alpha,double beta,double C[200][240],double A[200][200],double B[200][240])
{
int i,j,k;
#pragma ACCEL PIPELINE auto{__PIPE__L0}
#pragma ACCEL TILE FACTOR=auto{__TILE__L0}
#pragma ACCEL PARALLEL FACTOR=auto{__PARA__L0}
for (i = 0; i < 200; i++) {
#pragma ACCEL PIPELINE auto{__PIPE__L1}
#pragma ACCEL TILE FACTOR=auto{__TILE__L1}
#pragma ACCEL PARALLEL FACTOR=auto{__PARA__L1}
for (j = 0; j < 240; j++) {
double tmp = B[i][j];
#pragma ACCEL PARALLEL reduction=C FACTOR=auto{__PARA__L2}
for (k = 0; k < 200; k++) {
if (k < i) {
C[k][j] += alpha * tmp * A[i][k];
}
}
double temp2 = (double )0;
#pragma ACCEL PARALLEL reduction=temp2 FACTOR=auto{__PARA__L3}
for (k = 0; k < 200; k++) {
if (k < i) {
temp2 += B[k][j] * A[i][k];
}
}
C[i][j] = beta * C[i][j] + alpha * B[i][j] * A[i][i] + alpha * temp2;
}
}
}
Code 4: Code snippet of the Symm-opt-medium kernel with its pragmas starting with “#pragma”.
AutoDSE HARP ProgSG
__PARA__L0 1 1 1
__PARA__L1 1 1 1
__PARA__L2 25 25 25
__PARA__L3 200 32 200
__PIPE__L0 cg off cg
__PIPE__L1 off cg off
__TILE__L0 1 1 1
__TILE__L1 1 8 1
perf 4,345,927 35,536,546 4,345,927
TABLE VI: Best design points returned by AutoDSE, HARP, and ProgSG on “Symm-OPT-Medium” kernel.
void kernel_gemver(int n,double alpha,double beta,double A[400][400],double u1[400],double v1[400],double u2[400],double v2[400],double w[400],double x[400],double y[400],double z[400])
{
int i,j;
#pragma ACCEL PIPELINE auto{__PIPE__L0}
#pragma ACCEL TILE FACTOR=auto{__TILE__L0}
#pragma ACCEL PARALLEL FACTOR=auto{__PARA__L0}
for (i = 0; i < 400; i++) {
#pragma ACCEL PARALLEL reduction=A FACTOR=auto{__PARA__L4}
for (j = 0; j < 400; j++) {
A[i][j] += + u1[i] * v1[j] + u2[i] * v2[j];
}
}
#pragma ACCEL PIPELINE auto{__PIPE__L1}
#pragma ACCEL TILE FACTOR=auto{__TILE__L1}
#pragma ACCEL PARALLEL FACTOR=auto{__PARA__L1}
for (i = 0; i < 400; i++) {
#pragma ACCEL PARALLEL reduction=x FACTOR=auto{__PARA__L5}
for (j = 0; j < 400; j++) {
x[i] += beta * A[j][i] * y[j];
}
}
#pragma ACCEL PARALLEL FACTOR=auto{__PARA__L2}
for (i = 0; i < 400; i++) {
x[i] = x[i] + z[i];
}
#pragma ACCEL PIPELINE auto{__PIPE__L3}
#pragma ACCEL TILE FACTOR=auto{__TILE__L3}
#pragma ACCEL PARALLEL FACTOR=auto{__PARA__L3}
for (i = 0; i < 400; i++) {
#pragma ACCEL PARALLEL reduction=w FACTOR=auto{__PARA__L6}
for (j = 0; j < 400; j++) {
w[i] += alpha * A[i][j] * x[j];
}}}
Code 5: Code snippet of the Gemver-medium kernel with its pragmas starting with “#pragma”.
AutoDSE HARP ProgSG
__PARA__L0 1 1 1
__PARA__L1 1 1 1
__PARA__L2 1 1 1
__PARA__L3 1 8 8
__PARA__L4 2 64 25
__PARA__L5 1 10 10
__PARA__L6 25 20 25
__PIPE__L0 off off off
__PIPE__L1 fg off cg
__PIPE__L3 off cg off
__TILE__L0 1 1 1
__TILE__L1 1 1 1
__TILE__L3 8 1 1
perf 210,335 265,686 167,270
TABLE VII: Best design points returned by AutoDSE, HARP, and ProgSG on “Gemver-medium” kernel.