Practical Boolean Decomposition for Delay-driven LUT Map**
Abstract
Ashenhurst-Curtis decomposition (ACD) is a decomposition technique used, in particular, to map combinational logic into lookup tables (LUTs) structures when synthesizing hardware designs. However, available implementations of ACD suffer from excessive complexity, search-space restrictions, and slow run time, which limit their applicability and scalability. This paper presents a novel fast and versatile technique of ACD suitable for delay optimization. We use this new formulation to compute two-level decompositions into a variable number of LUTs and enhance delay-driven LUT map** by performing ACD on the fly. Experiments with heavily optimized benchmarks show an average delay improvement of % and an area reduction of % compared to state-of-the-art LUT map**, with affordable run time. Additionally, our method improves the best-known delay for benchmarks in the EPFL synthesis competition.
Index Terms:
Logic synthesis, Boolean decomposition, technology map**, FPGAI Introduction
Ashenhurst-Curtis decomposition (ACD) [1, 2], also known as Roth-Karp decomposition [3], is a powerful technique that finds a decomposition of a Boolean function into a set of sub-functions and a composition function with reduced support. ACD finds applications in logic optimization and technology map**. The noteworthy use cases of ACD are in map** into standard cells [4] and field-programmable gate arrays (FPGA) [5], decomposition of multi-valued relations [6], and encoding of multi-valued networks [7].
Traditional applications rely on the original formulation of ACD [1, 2, 3], breaking the input variables into two groups: the bound set (BS) and the free set (FS). Other approaches to ACD [5] allow for a shared set (SS) when one or more LUTs in terms of the BS variables are single-variable functions (buffers). The larger the SS size, the fewer LUTs are required. For instance, Figure 1 shows an ACD of a function with BS, FS, and SS resulting in three -input LUTs. In [5], maximizing the SS is implemented using binary decision diagrams (BDDs) [8]. More recently, truth-table-based implementations eliminated the need for explicitly constructing a BDD, resulting in a faster decomposition [9, 10].
ACD has been applied to map into fixed lookup table (LUT) structures [10] as a way to mitigate structural bias and improve the quality of standard LUT map**. This approach utilizes heuristic variable re-ordering to find an ACD, supporting up to SS variable. Additionally, ACD has been used in post-map** resynthesis [9], when logic cones composed of several LUTs are collapsed into single-output Boolean functions and re-expressed using fewer LUTs. The authors proposed to use disjoint-support decomposition (DSD) and Shannon’s expansion to pack logic into LUTs while supporting up to SS variables.
Since ACD is often applied only to functions up to or inputs (for LUT structures composed of two or three -LUTs, respectively), state-of-the-art LUT map** is performed through local substitutions applied to an initial graph representation, called subject graph. Generally, delay-optimal map** w.r.t. the subject graph is feasible in polynomial time [11], while area-optimal map** is NP-hard [12]. However, the structure of the subject graph highly impacts the result. This phenomenon is known as structural bias. To mitigate structural bias, methods in the literature generate a set of structural choices (or decompositions) available during map** [13, 14, 15].
This paper offers two main contributions. First, we revisit the formulation of ACD with SS to enhance its computationally efficiency in LUT mappers and post-map** resynthesis engines performing delay optimization. Based on the ideas presented in [16], our algorithm is truth-table-based and flexible in the number of FS, BS, and SS variables, and in the number of BS functions. Our ACD runs up to x faster, compared to [10], and up to x faster, compared to [9] when performing decompositions into two -LUTs. Furthermore, it also finds considerably more solutions.
Second, we use ACD for the delay optimization of LUT networks. The idea is to compute functional decompositions using the timing-critical variables in the FS and the rest of the variables in the BS and SS. We integrate our ACD into the state-of-the-art LUT mapper for delay optimization. To our knowledge, this is the first practical and scalable work that uses ACD for delay-driven LUT map**.
We experimentally evaluate the performance of ACD and compare map** based on Boolean decomposition against state-of-the-art methods:
-
1.
We compare our ACD method against other decomposition methods in ABC, showing better quality with a competitive or better run time.
-
2.
We demonstrate that map** with ACD can efficiently mitigate structural bias and considerably reduce the delay. We compare the default LUT mapper in ABC, the LUT mapper with Boolean decomposition in ABC, and the proposed mapper with integrated ACD. We show that map** with ACD outperforms the other mappers in delay by on average with and without structural choices [15]. Moreover, we show that an additional map** round using the network obtained by ACD as a structural choice can further improve the delay, compared to the standard LUT mapper, by % with an area reduction of %.
-
3.
We present new best results in the EPFL competition.
II Preliminaries
This section introduces the basic notations and background related to logic networks, decomposition, and LUT map**.
II-A Definitions
A Boolean function is a map** from a -dimensional Boolean space into a -dimensional one: .
A truth table representation of a -input Boolean function can be encoded as a bit string , i.e., a sequence of bits, of length . A bit at position is equal to the value taken by under the input assignment where
The positive cofactor of a Boolean function with respect to a variable , represented as , is the Boolean function obtained by setting . Similarly, the negative cofactor is the Boolean function obtained by setting .
In the classical representation, we refer to the leftmost input column of a truth table as the most significant variable ( and the rightmost input column as the least significant variable (. A swap of two variables results in the interchange of the corresponding two-variable cofactors, thereby altering the truth table.
Figure 2 depicts two truth tables represented as bit strings, one in binary and one in hexadecimal. Notably, the rightmost truth table can be derived from the leftmost one by swap** the variables and . Marked next to both truth tables are the cofactors with respect to two most significant variables.
A completely specified Boolean function essentially depends on a variable if there exists an input combination such that the value of the function changes when the variable is toggled (). The support of is the set of all variables on which function essentially depends. The supports of two functions are disjoint if they do not contain common variables. A set of functions is disjoint if their supports are pair-wise disjoint.
A Boolean network is modeled as a directed acyclic graph (DAG) with nodes represented by Boolean functions. The sources of the graph are the primary inputs (PIs), the sinks are the primary outputs (POs). For any node , the fanins of is a set of nodes driving , i.e. nodes that have an outgoing edge towards . Similarly, the fanouts of is a set of nodes driven by node , i.e., nodes that have an incoming edge from . A -LUT network is a Boolean network composed of -input lookup tables (-LUTs) capable of realizing any -input Boolean function. An and-inverter graph (AIG) [17] is a Boolean network where nodes are -input ANDs and edges may implement inverters.
A cut of a Boolean network is a pair (, ), where is a node called root, and is a set of nodes, called leaves, such that 1) every path from any PI to node passes through at least one leaf and 2) for each leaf , there is at least one path from a PI to passing through and not through another leaf. The size of a cut is the number of leaves. A cut is -feasible if its size does not exceed .
II-B Ashenhurst-Curtis decomposition
Ashenhurst-Curtis decomposition (ACD) [1, 2, 3], of a single-output Boolean function can be expressed as follows:
(1) |
where is the bound set (BS), is shared set (SS), and is the free set (FS). These sets are disjoint variable subsets, which together form the support of . The function may be multi-output with the number of outputs less than the BS size. The single-output functions in are referred to as BS functions. The function is referred to as the composition function. When decomposing into -LUTs, the composition function is typically chosen to fit into one -input LUT. Figure 1 shows an ACD of an -input function into three -input LUTs with a -variable BS, a -variable SS, and a -variable FS. The decomposition generates two BS functions (, ) and a composition function () with inputs.
II-C FPGA technology map**
LUT map** is the process of expressing a Boolean network in terms of -input lookup tables (-LUTs). Before map**, the network is represented as a k-bounded Boolean network called the subject graph, which contains nodes with a maximum fanin size of k. The AIG is the most common subject graph representation. The subject graph is transformed into a mapped network by applying local substitutions to sections of the circuit defined by cuts, which are computed using cut enumeration [18]. A LUT mapper computes a map** solution by selecting a subset of the cuts that cover the subject graph while minimizing a cost function. The state-of-the-art LUT mapper computes cuts and refines the map** solution in several map** passes using heuristics based on delay, area, and edge count. For further details, refer to [19].
III Improvements to ACD
This section discusses a fast and versatile truth-table-based implementation of ACD for single-output functions with support for a shared set. We propose several novelties that make ACD practical within LUT mappers and resynthesis methods. Figure 3 illustrates the ACD computation. The BS, SS, FS, and the number of BS functions used are flexible and determined during the decomposition. The composition function () is implemented as a multiplexer of cofactors with respect to BS functions and the shared set. Functions dependent on the FS (, called FS functions, are the data inputs of the multiplexer found inside the composition function. BS functions and the shared set are instead the selection inputs.
This definition of decomposition reflects the one used by previous approaches [5]. Specifically, the decomposition is generic, i.e., it includes other types of decomposition. For instance, a Shannon’s expansion:
where is a selector of a multiplexer, can be re-expressed in our ACD format:
where is a FS variable, and are BS functions, and FS fuctions are , , , and .
In this section, we first present how to efficiently check the existence of a feasible ACD and assign variables to the FS, BS, and SS (Section III-A). Next, we show how to compute the decomposition while minimizing the number of BS functions and their support (Section III-B).
III-A Finding a feasible decomposition
After defining the properties of ACD, in this section we present an efficient method to check the existence of a Boolean decomposition and find an assignment of support variables to the FS and the BS (and SS). In particular, we focus on decomposition into two levels of -input LUTs. For simplicity, in this section we consider SS variables a part of the BS.
The first step to derive a decomposition is to partition of variables into FS and BS. Given a truth table, our approach enumerates different free sets. Let be the number of variables in the support of a function to decompose. Let be the number of variables to consider in the FS. The remaining variables are considered in the BS. The number of different free sets is . Regarding the choice of value when searching for a feasible two-level decomposition, for an -input function and -input LUTs, it is convenient to consider () variables in the FS, so that at most variables are considered in the BS. For instance, when and , we can choose and evaluate different -variable free sets.
For each FS, the truth table is transformed to have the FS variables as the least significant ones, compared to the BS variables. The variable reordering is performed using a dedicated procedure, which swaps two variables. Note that when enumerating all the free sets the first FS composed of the P least significant variables in the support of the function does not need variable swap** since the original truth table already reflects this order. Then, every consecutive FS can be derived from a previous FS by swap** one variable in with one in . The complexity to explore all the FS is of swap operations. Figure 2 shows how a variable swap affects the truth table.
Each input assignment to the BS variables selects one -input function in terms of the FS variables. Specifically, each -input function is a cofactor with respect to . From a truth table in this format, FS functions are easily computed by extracting groups of bits at offsets with . Informally, FS functions are listed next to each other. Figure 2 graphically depicts the extraction of cofactors with respect to the two most significant variables.
Example 1: Let us consider the -variable function represented in hexadecimal format as a truth table 0x8804800184148111. Let us assume that the FS variables are the two least significant variables and the BS variables are the four most significant ones. The functions in terms of FS variables have truth tables with bits. There are of them, corresponding to hexadecimal digits in the truth table (0x8, 0x8, 0x0, 0x4, etc).
The target function can be realized using bound set functions if the number of unique FS functions, known as column multiplicity , does not exceed , hence . If , the composition function can be implemented as a -LUT.
Example 2: Continuing Example 1, there are FS functions of which only are unique. The FS functions are 0x8, 0x0, 0x4, and 0x1. Hence, the column multiplicity , which needs at least BS functions. Hence, this partition of variables into FS and BS produces a feasible support-reducing decomposition into -input LUTs. Using Figure 3, ACD assigns FS functions to . Then, two BS functions of at most inputs are necessary to select the correct FS function.
We employ the enumeration of free sets while counting the number of unique cofactors to check if a support-reducing decomposition exists. In practice, a sufficient condition for a -level decomposition is to have and , i.e., the composition function is -feasible, and the number of remaining variables in the BS does not exceed .
After identifying a partition of variables into FS and BS, and the corresponding unique FS functions, our method uses the techniques in Section III-B to produce a decomposition while minimizing the number of BS functions and their support.
III-B Functional encoding and support minimization
Once a partition of variables into FS and BS with a feasible decomposition is found, the BS functions are extracted by assigning each FS function to an encoding. Informally, an encoding represents the assignment of FS functions to the data inputs of the MUX of Figure 3 (e.g., the encoding of is ). While any encoding that distinguishes FS functions is a valid solution, a good encoding also minimizes the number of BS functions required (by maximizes the shared set), and the functional support. In particular, it is crucial to find an encoding that minimizes the support for three reasons. First, if , by minimizing the support, each BS function would ideally fit into a -LUT, and the decomposition is feasible in levels. Second, minimizing the support maximizes the shared set (buffer BS functions), reducing the number of required LUTs. Third, the number of edges required is reduced, hel** routability. Finding a feasible encoding is similar to solving constrained encoding problems [20, 21, 22].
An encoding is an assignment of a code of length to each FS function. A variable takes one of the three values, , , or , indicating the ON-set, OFF-set, and DC-set, respectively. Let i-sets be the set of Boolean functions in terms of the BS variables encoding FS functions using one-hot encoding. Precisely, an i-set represents one FS function and takes value when an input assignment to the BS variables results in the corresponding FS function.
Example 3: Using Example 2, the i-set corresponding to the FS function 0x8 is 1100100010001000 in binary format. Note that the truth table has variables and has value when the original function is 0x8.
I-sets are used to derive a more compact encoding with a two-step procedure. The first one enumerates candidate BS functions. The second one solves a unate covering problem in which columns are candidate BS functions and rows are pairs of FS functions to be distinguished.
Candidate BS functions are functions depending on BS variables whose output can used as to encode FS functions. They are enumerated by combining i-sets. To leverage all the functional degrees of freedom of a strict encoding, i-sets in a BS candidate can be either in the ON-set, OFF-set, or don’t-care (DC) set. Since candidate BSs are used as select inputs of a multiplexer, BS candidates can distinguish elements in the ON-set (takes value ) against elements in the OFF-set (takes value ). In encoding problems, BS functions are called dichotomies, while the pairs of functions to be distinguished are referred to as seed dichotomies [22]. Don’t-cares in BS candidates are also important to minimize the support, which translates into fewer LUT edges.
Example 4: Continuing Example 3, let us consider the candidate bound set function that has the i-sets {0x8, 0x1} in the ON-set and the i-set {0x4} in the OFF-set. Its function in binary format is 11-01–110101111 where “-” is a don’t care. When , either 0x8 or 0x1 are selected. When , 0x4 is selected. The corresponding dichotomy is {{0x8, 0x1},{0x4}}. In this case, function distinguishes 0x8 from 0x4 and 0x1 from 0x4, covering the two seed dichotomies {{0x8},{0x4}} (or {{0x4},{0x8}}) and {{0x1},{0x4}} (or {{0x4},{0x1}}).
A candidate bound set function is generated by assigning each i-set to be in the ON-set, OFF-set, or DC-set. Hence, the total number of possible BS candidates is . Nonetheless, some BS candidates are interchangeable, i.e., one candidate can be obtained by swap** the ON-set and the OFF-set of another BS candidate. Our enumeration removes these symmetries by fixing one i-set to be only in the ON-set or DC-set, enumerating only BS candidates. Moreover, candidates not distinguishing any pair of FS functions are removed. As a special case, if is a power of , the number of possible BS candidates reduces to by splitting the FS functions to be equally distributed between ON-set and OFF-set, i.e., each BS candidate must distinguish half of the FS functions against the other half.
One limitation of this method is that the number of BS candidates is exponentially dependent on the column multiplicity. However, we may further reduce the number of BS candidates when it is too large. In particular, for an ACD into -LUTs the maximum column multiplicity to support is . Consequently, the highest number of BS candidates is million for . To maintain a reasonable number of BS candidates, our method does not use don’t cares for problems with , enumerating candidates and reducing the highest number of candidates to thousand. Through experimentation, we have observed that imposing this limitation scarcely affects the quality of the encoding, while substantially enhancing run-time efficiency. Conversely, extending this method to lower multiplicity values noticeably compromises the solution quality.
Each BS candidate function is associated with a cost that depends on the number of variables in its support. The number of variables is computed with a special procedure that considers don’t cares. Then, a covering table is constructed by having all the pairs of FS functions to be distinguished (seed dichotomies) as rows and the BS candidates as columns. A row-column entry is if the BS candidate of column distinguishes the seed dichotomy . A solution that minimizes the support is computed by solving a minimum-cost covering problem [22]. The solution must cover all the rows while minimizing the cost. We use greedy covering followed by local search to compute cost-minimizing cover. A single iteration of greedy covering extracts one column covering the most non-covered rows while minimizing the cost. The process is iterated until a solution is found. Then, the solution is iteratively improved by replacing one column with another having a lower cost.
Example 5: Figure 4 shows a covering table reflecting the examples in this section. Each column in the table is a candidate BS function shown as a truth table in hexadecimal format on variables. Each BS candidate has a cost based on the number of variables on its support. Each row is a seed dichotomy. An element in the table is if the BSj distinguishes the seed dichotomy . The best solution with cost takes the second and third columns and results in two BS functions depending on variables.
Given a solution, an encoding of the FS functions is obtained by assigning a code , in which each variable corresponds to a selected BSi candidate.
Example 6: Continuing Example 5, a minimum cover involves BS 0x1177, by taking 0x4 and 0x1 in the ON-set, and BS 0x2727 by taking 0x0 and 0x1 in the ON-set. Given the BS functions, the encoding of the FS functions assigns the following codes to in Figure 3: 00, 01, 10, and 11. Finally, the composition function is computed using the FS and its encoding, resulting in function 0x1048 when represented in hexadecimal format. Consequently, the function has been successfully decomposed using three -LUTs.
IV Technology map** with ACD
In this section, we leverage the Ashenhurst-Curtis decomposition (ACD) methods described in Section III to improve the delay of LUT networks. ACD can be used in two ways: 1) as part of LUT map** or 2) as a post-map** resynthesis method to compact logic and decrease the delay. In this work, we focus on the former usage since it has more flexibility and optimization opportunities. Although post-map** resynthesis is not covered in this work, its implementation would follow a methodology similar to [9]. First, this section discusses how to perform delay-oriented functional decomposition for any number of FS variables and BS functions. Then, it describes the integration of ACD in a technology mapper.
IV-A Delay-oriented ACD
Let us consider a node in a -LUT network and a cut rooted in that contains leaves in the input sub-network of . Among all the leaves, some are timing-critical and some are not. Let be the latest arrival delay of a leaf in . We use ACD to find an implementation that realizes the function of cut with delay where , assuming a unit-delay model. Specifically, we use the timing-critical leaves of in the FS and other non-critical ones in the BS or SS. This transformation may reduce the worst delay of a LUT network when applied on the critical path.
The ACD-based transformation is performed in two steps. First, our method verifies the existence of a delay-minimizing decomposition. Second, if a decomposition exists, it solves the encoding problem and returns a solution.
IV-A1 Checking the existence of a decomposition
Algorithm 1 shows the procedure evaluate to check the existence of an ACD. The algorithm receives the function represented as a truth table of a large cut with size where . Set contains a list of timing-critical variables with delay . First, the truth table is transformed to have critical variables as the least significant ones since they must be in the FS (at line 1). The proposed approach limits to ensure a two-level decomposition without solving the encoding problem. Hence, the number of variables in the FS must be at least , and to include all the delay-critical variables (at line 1). For each FS of variables, the column multiplicity value is computed using the method described in Section III-A, and the smallest one is returned (at line 1). In this case, since delay-critical variables are always part of the FS, different combinations are enumerated. If the smallest multiplicity found can be implemented using at most BS functions, a delay-minimizing ACD exists. In this case, variables in the FS have the delay increase of while other variables have the delay increase of (at line 1). If, on the other hand, a decomposition with does not exist, the function is not decomposable.
The loop in line 1 begins checking the existence of a decomposition with a smaller value of . This approach is based on the theoretical property that if a function is not decomposable for the given value of , it is also not decomposable for . Then, if a decomposition exists, the loop attempts to increase the number of variables in the free set. Specifically, maximizing the free set to include non-critical variables has multiple benefits. Primarily, the decomposition would have a reduced column multiplicity, which simplifies the encoding problem. Additionally, maximizing the free set may increase the required time of the associated non-critical signals, facilitating the area-recovery process of technology map**.
IV-A2 Computing the decomposition
After applying evaluate, another procedure decompose is used to compute the actual decomposition using the methods described in Section III-B.
IV-B LUT map** with ACD
The methods described in Section IV-A have been integrated into the LUT map** algorithm in [19]. Each map** iteration computes -feasible cuts rooted in nodes of the subject graphs and selects one best cut for each node based on cost functions and slack. Typically, enumerated cuts are -feasible, i.e., any cut abstracts a -LUT. In our implementation, cut enumeration computes large cuts up to size , where is provided by the user. During cut enumeration, the mapper computes cut functions as truth tables. For the non--feasible computed cuts, the mapper uses Algorithm 1 to check the existence of a delay-minimizing decomposition into -LUTs. If a decomposition is not feasible, the cut is discarded. If a decomposition exists, the cut delay is computed using the propagation delay returned by Algorithm 1. The area is computed pessimistically, neglecting the existence of a shared set, i.e., . To have precise area information, i.e., the number of required LUTs, ACD would need to solve the encoding problem and compute the decomposition. However, experimentally, not running the decomposition on the fly reduces the run time considerably with negligible impact on the final circuit area.
The mapper uses -feasible cuts with ACD in the delay map** pass, while it uses -feasible cuts in the following area recovery iterations. Note that area-recovery aims at improving the solution over non-critical paths and can always re-use the best cuts from the previous pass, such that the required times are met. After the last map** pass, a cover is generated consisting of - and -feasible cuts. At this stage, the mapper decomposes the non--feasible cuts into -LUTs.
V Experiments
This section presents an experimental evaluation of the proposed LUT map** with ACD. First, the ACD algorithm proposed in this paper is compared with other state-of-the-art methods for decomposing practical functions. Then, we evaluate ACD for delay-driven LUT map**. While the experiments are reported for -input LUTs, similar improvements have been obtained for -input LUTs as well.
The proposed methods have been implemented in ABC [23]. For our experiments, we use the EPFL combinational benchmark suite [24] containing several circuits provided as and-inverter graphs (AIGs). The baseline has been obtained using the commands and scripts “dfraig; resyn; resyn2; resyn2rs; if -y -K 6; resyn2rs” in ABC, which perform a high-effort size and depth AIG optimization. In particular, it combines SAT swee** [25], scripts for delay-oriented AIG optimization [17], and lazy man’s logic synthesis [26], which is the most aggressive depth minimization command in ABC. The experiments have been conducted on an Intel i quad-core GHz on MacOS. The results have been verified using combinational equivalent checkering in ABC. We extended the LUT mapper if in ABC to perform ACD as discussed in Section IV. The following commands are used in the experiments:
-
•
dch (-f): computes structural choices used to mitigate the structural bias [15], where -f stands for “fast”;
-
•
if -K 6: performs delay-oriented technology map** with choices into -LUTs using -feasible cuts;
-
•
if -s -S 66 -K 8: performs delay-oriented technology map** using -feasible cuts and decomposes logic for minimal delay into two -LUTs using a SAT-based formulation (available in ABC but not published);
-
•
if -Z 6 -K 8: performs technology map** into -LUTs using the proposed implementation of delay-oriented ACD described in Section IV for -feasible cuts;
-
•
st: derives an AIG from an LUT network.
V-A Decomposition success rate
ACD type | 7 vars (41071) | 8 vars (107466) | 9 vars (195602) | 10 vars (313649) | 11 vars (404991) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Success (%) | Time(s) | Success (%) | Time(s) | Success (%) | Time(s) | Success (%) | Time(s) | Success (%) | Time(s) | ||||||
lutpack [9] | 98.34% | 20.39 | 83.47% | 64.37 | 69.92% | 154.38 | 48.95% | 334.79 | 26.87% | 897.55 | |||||
S66 [10] | 84.18% | 0.60 | 69.24% | 2.57 | 52.13% | 4.99 | 37.36% | 6.99 | 19.14% | 9.79 | |||||
66 1-SS | 97.30% | 0.28 | 82.23% | 1.41 | 74.24% | 4.20 | 63.06% | 9.39 | 32.88% | 16.43 | |||||
66 M-SS | 99.82% | 0.30 | 92.94% | 3.08 | 84.71% | 9.92 | 63.06% | 9.73 | 32.88% | 16.58 |
In this experiment, we evaluate the performance of ACD in decomposing functions by comparing it against other implementations in ABC. Specifically, we test the number of functions that can be successfully decomposed into two -LUTs and the run time needed. We run this experiment on practical functions, i.e., functions that are observable in designs and benchmarks, which include fully-, partially-, and non-DSD-decomposable functions. We extract practical functions from the EPFL benchmarks. Since the number of practical functions can be large, we classify them into -equivalence classes employing the heuristic sifting algorithm [27, 28].
Table I shows the percentage of decomposable functions and the runtime for different methods and support sizes. For instance, the first column contains results for decomposing practical -input functions, where indicates the number of unique NPN classes collected. Each row of the table shows one ACD method. The first method lutpack [9] performs a heuristic ACD using DSD and the Shannon’s expansion, supporting up to -SS variables. The second method, S66 [10], performs ACD using heuristic variable re-ordering supporting at most -SS variable. Finally, we present two variants of our decomposition method restricted to use -LUTs. One uses up to -SS variable (66 -SS), the other (66 M-SS) has no restrictions on the number of SS variables. The approaches described in this paper outperform the state of the art in quality for a competitive or better run time.
V-B Decomposition success rate for delay optimization
N late | ACD type | 7 vars | 8 vars | 9 vars | 10 vars | 11 vars |
---|---|---|---|---|---|---|
0 | 66 M-SS | 99.82% | 92.94% | 84.71% | 63.06% | 32.88% |
Generic | 100.00% | 100.00% | 98.05% | 90.20% | 32.88% | |
1 | 66 M-SS | 96.59% | 79.60% | 61.51% | 37.35% | 16.54% |
Generic | 100.00% | 100.00% | 97.57% | 83.23% | 16.54% | |
2 | 66 M-SS | 86.22% | 59.78% | 39.28% | 23.74% | 10.95% |
Generic | 100.00% | 100.00% | 94.19% | 66.56% | 10.95% | |
3 | 66 M-SS | 65.11% | 36.37% | 21.25% | 13.78% | 6.96% |
Generic | 93.78% | 86.03% | 76.82% | 44.51% | 6.96% | |
4 | 66 M-SS | 36.96% | 17.00% | 8.62% | 7.21% | 4.43% |
Generic | 54.55% | 40.42% | 25.45% | 23.70% | 4.43% | |
5 | 66 M-SS | 14.52% | 5.42% | 2.96% | 2.84% | 2.61% |
Generic | 14.52% | 5.42% | 2.96% | 2.84% | 2.61% |
We extend the previous experiment to evaluate delay minimization using the proposed ACD method. This experiment tests the success rate of the decomposition for practical functions given delay-critical variables, which are required to be in the free set. Informally, for delay-critical variables with delay , this experiment checks the existence of a decomposition with delay . We only consider 66 M-SS and generic ACD since other known methods do not perform delay minimization using the input arrival time. For each function, we randomly generate up to unique sets of delay-critical variables and test the decomposition for each one of them. While 66 M-SS is limited to two LUTs, generic can use up to LUTs.
Table II presents the success rate based on the number of delay-critical variables, shown in the column “N late”. The table highlights the advantage of supporting multiple BS functions. Generic ACD has a high success rate in most cases. Limitations occur when the number of delay-critical variables exceeds or the number of variables in the support is or more. Generally, the decomposition of -input variables is rare. However, many input variables are still decomposable.
V-C Delay-driven LUT map**
Benchmark | ABC: dch; if -K 6 | ABC: dch; if -s -S 66 -K 8 | ACD | ACD; st; dch -f; if -K 6 | ||||||||||||||||
LUTs | Edges | Depth | Time (s) | LUTs | Edges | Depth | Time (s) | LUTs | Edges | Depth | Time (s) | LUTs | Edges | Depth | Time (s) | |||||
adder | 363 | 1433 | 22 | 0.18 | 362 | 1465 | 20 | 0.28 | 383 | 1519 | 16 | 0.20 | 353 | 1518 | 10 | 0.39 | ||||
bar | 1664 | 9344 | 4 | 0.44 | 1664 | 9344 | 4 | 0.57 | 1664 | 9344 | 4 | 0.47 | 1006 | 5274 | 4 | 0.76 | ||||
div | 8618 | 32394 | 406 | 6.62 | 9107 | 33665 | 397 | 13.42 | 11644 | 44496 | 326 | 7.16 | 9068 | 39167 | 271 | 21.19 | ||||
hyp | 58393 | 239097 | 1864 | 5.43 | 61701 | 247699 | 1840 | 31.82 | 65615 | 264998 | 1396 | 11.13 | 61769 | 263254 | 1034 | 19.76 | ||||
log2 | 9712 | 43562 | 58 | 17.05 | 10172 | 44943 | 58 | 30.06 | 10313 | 46365 | 56 | 17.81 | 9429 | 42533 | 57 | 39.09 | ||||
max | 831 | 3804 | 14 | 0.37 | 840 | 3668 | 14 | 0.63 | 1211 | 5578 | 12 | 0.42 | 871 | 4277 | 11 | 1.39 | ||||
multiplier | 7383 | 34137 | 36 | 6.01 | 7334 | 32781 | 36 | 12.11 | 7693 | 35798 | 33 | 6.82 | 6800 | 31705 | 31 | 13.32 | ||||
sin | 1928 | 8445 | 30 | 1.31 | 1948 | 8463 | 30 | 4.94 | 2052 | 8913 | 29 | 1.50 | 1830 | 8178 | 30 | 2.91 | ||||
sqrt | 7515 | 29573 | 663 | 4.17 | 7972 | 30610 | 638 | 12.66 | 10156 | 38558 | 519 | 4.73 | 9292 | 36030 | 476 | 8.77 | ||||
square | 4122 | 17319 | 23 | 1.98 | 4165 | 17547 | 22 | 3.91 | 4107 | 17924 | 18 | 2.22 | 4118 | 18285 | 14 | 5.15 | ||||
arbiter | 1833 | 8982 | 6 | 1.64 | 1879 | 8836 | 6 | 2.02 | 1850 | 8987 | 6 | 1.70 | 2037 | 8780 | 6 | 3.33 | ||||
cavlc | 137 | 707 | 4 | 0.13 | 104 | 491 | 4 | 0.56 | 137 | 707 | 4 | 0.15 | 123 | 655 | 4 | 0.20 | ||||
ctrl | 30 | 133 | 2 | 0.07 | 28 | 127 | 2 | 0.08 | 30 | 133 | 2 | 0.08 | 29 | 126 | 2 | 0.08 | ||||
dec | 287 | 684 | 2 | 0.09 | 287 | 1404 | 2 | 0.1 | 287 | 684 | 2 | 0.10 | 284 | 816 | 2 | 0.12 | ||||
i2c | 312 | 1360 | 3 | 0.16 | 306 | 1316 | 3 | 0.36 | 319 | 1378 | 3 | 0.19 | 297 | 1329 | 3 | 0.27 | ||||
int2float | 52 | 258 | 3 | 0.08 | 46 | 205 | 3 | 0.18 | 52 | 258 | 3 | 0.09 | 50 | 251 | 3 | 0.11 | ||||
mem_ctrl | 11037 | 48812 | 18 | 10.24 | 10830 | 46368 | 18 | 31.67 | 11232 | 49483 | 17 | 11.40 | 10398 | 45793 | 16 | 20.57 | ||||
priority | 178 | 725 | 6 | 0.11 | 182 | 736 | 6 | 0.18 | 185 | 736 | 6 | 0.12 | 171 | 698 | 6 | 0.17 | ||||
router | 89 | 285 | 4 | 0.09 | 61 | 283 | 4 | 0.14 | 92 | 290 | 4 | 0.09 | 89 | 279 | 4 | 0.12 | ||||
voter | 1838 | 8596 | 13 | 2.23 | 1784 | 8624 | 13 | 4.14 | 1838 | 8583 | 13 | 2.32 | 1777 | 8426 | 13 | 4.82 | ||||
Improvement | 2.57% | -2.57% | 1.04% | -8.13% | -7.87% | 7.52% | 2.20% | -0.30% | 12.39% | |||||||||||
Total | 58.40 | 149.83 | 68.70 | 142.52 |
Table III compares four technology map** strategies for delay minimization during map** into -LUTs, assuming a unit-delay model. Each strategy takes the baseline as an input and computes structural choices before map**. Structural choices have not been used for the benchmark hyp due to a known bug in ABC. The proposed method is compared against standard LUT map** and map** into LUT structures. Command ACD denotes our mapper with Boolean decomposition using the sequence “dch; if -Z 6 -K 8”. We do not compare against [10] and [9] because those methods do not support delay minimization. Furthermore, we do not compare against the recent mapper with gate decomposition based on bin-backing [29]. Nevertheless, the mapper in [29] would improve the average delay of ABC if by only %.
Map** into LUT structure “” composed of two 6-LUTs, which is a SAT-based version of structural ACD, reduces depth by % and the area by % on average, at the cost of increasing the number of edges by %. The proposed LUT map** with ACD improves the depth of the LUT network by % on average while increasing the number of LUTs and edges by % and %, respectively.
Note that most of the improvement is concentrated in the first benchmarks since others are already close to their best known depth [30]. For of them, the delay reduction exceeds % and is up to %. Practically, part of the area increase can be reduced by area-recovery methods [9, 31, 32], using delay relaxation, or by an additional map** step applied after ACD. The rightmost strategy performs the latter option. The LUT count and edge count are reduced considerably, leading to an area improvement of %, compared to traditional technology map** with choices. Also, the logical depth further decreases up to %. Specifically, the result after ACD is used as a choice to improve the next round of technology map** because choices extracted from map** with ACD are more structurally suited to delay-oriented map**, compared to the original AIG. Moreover, structural choices help reduce the area over the non-critical paths. Note that a second map** round does not provide practical benefits if applied to the default LUT mapper (leftmost column) since the network after deriving the AIG is structurally similar to the baseline. Furthermore, benchmark hyp is noticeably improved by remap** both in area and delay without using structural choices. Regarding the run time, map** with ACD is faster than map** into LUT structures while being more general.
V-D EPFL synthesis competition
Benchmark | Best [30] | dch -f; if -K 6 | dch -f; if -Z 6 -K 10 | ||||||
---|---|---|---|---|---|---|---|---|---|
LUTs | Depth | LUTs | Depth | LUTs | Depth | ||||
adder | 347 | 5 | 360 | 6 | 445 | 5 | |||
bar | 512 | 4 | 512 | 4 | 512 | 4 | |||
div | 25318 | 175 | 23461 | 192 | 31526 | 175 | |||
hyp | 182723 | 483 | 122394 | 511 | 154903 | 473 | |||
log2 | 8617 | 52 | 8778 | 60 | 9613 | 51 | |||
max | 1114 | 6 | 1113 | 7 | 1250 | 6 | |||
multiplier | 7785 | 25 | 6839 | 28 | 6903 | 25 | |||
sin | 680530 | 10 | 1820 | 33 | 2379 | 27 | |||
sqrt | 29593 | 162 | 30945 | 172 | 41626 | 156 | |||
square | 3732 | 10 | 4189 | 11 | 4275 | 10 |
This experiment shows that map** using ACD can improve well-optimized LUT networks, resulting in best known results for benchmarks in the EPFL synthesis competition. The previous best results were obtained using a portfolio of heavy logic optimizations applied to various representations, such as AIGs and LUT networks. In recent years, results have been further improved using design-space exploration (DSE) techniques that incrementally generate optimization scripts.
We obtain the optimized AIGs by repeatedly running the script used in the baseline of Table III along with additional delay-oriented AIG commands in ABC. From the obtained AIG, we compare traditional LUT map** with choices to LUT map** with ACD. Notably, results from the traditional mapper are quite far from the best results. This observation shows, as expected, that our technology-independent optimization finds worse AIGs than those used to obtain the best results. However, LUT map** with ACD matches or improves the depth for almost all benchmarks. The improved benchmarks are hyp, log2, multiplier, and square. Remarkably, our method reduces the depth of hyp by levels, compared to the state of the art, while reducing area by %. In the benchmark multiplier, our result matches the depth but improves the number of LUTs. Benchmark sin is the only one where there is a large gap compared to the best result. In particular, the best result for sin requires significant logic duplication that is not performed in our synthesis flow. Contrarily to many other methods used to produce the best results, our results in Table III are obtained directly by LUT map** without employing post-map** optimization.
VI Conclusion
This work proposes a novel formulation of Ashenhurst-Curtis decomposition (ACD) that enables efficient technology map** and post-map** resynthesis. The algorithm is truth-table-based and works for any size of the free set, bound set, and shared set, which makes it well-suited for delay optimization. We have shown that the proposed Boolean decomposition improves state-of-the-art in the decomposition quality with a competitive runtime. We have implemented and integrated the proposed method into a delay-driven LUT mapper. The experiments have shown that LUT map** with ACD can improve the average delay by %, compared to the traditional structural LUT map** with choices. Furthermore, the proposed approach has produced best results for test cases in the EPFL synthesis competition.
Acknowledgments
This research was supported by the SNF grant “Supercool: Design methods and tools for superconducting electronics”, 200021_1920981, and Synopsys Inc.
References
- [1] R. L. Ashenhurst, “The decomposition of switching functions,” 1957, pp. 74–116.
- [2] J. P. Curtis, “A new approach to the design of switching circuits,” 1962.
- [3] J. P. Roth and R. M. Karp, “Minimization over boolean graphs,” IBM Journal of Research and Development, vol. 6, no. 2, pp. 227–238, 1962.
- [4] V. N. Kravets and K. A. Sakallah, “Constructive multi-level synthesis by way of functional properties,” Ph.D. dissertation, 2001.
- [5] C. Legl, B. Wurth, and K. Eckl, “Computing support-minimal subfunctions during functional decomposition,” Trans. VLSI, vol. 6, no. 3, pp. 354–363, 1998.
- [6] M. Perkowski, M. Marek-Sadowska, L. Jozwiak, T. Luba, S. Grygiel, M. Nowicka, R. Malvi, Z. Wang, and J. Zhang, “Decomposition of multiple-valued relations,” in Proc. Inter. Symp. on Mult.- Valued Logic, 1997, pp. 13–18.
- [7] J.-H. Jiang, Y. Jiang, and R. K. Brayton, “An implicit method for multi-valued network encoding,” in Proc. IWLS, 2001, pp. 127–131.
- [8] R. Bryant, “Graph-based algorithms for boolean function manipulation,” IEEE Trans. on Computers, vol. C-35, no. 8, pp. 677–691, 1986.
- [9] A. Mishchenko, R. Brayton, and S. Chatterjee, “Boolean factoring and decomposition of logic networks,” in Proc. ICCAD, 2008, pp. 38–44.
- [10] S. Ray, A. Mishchenko, N. Een, R. Brayton, S. Jang, and C. Chen, “Map** into LUT structures,” in Proc. DATE, 2012.
- [11] J. Cong and Y. Ding, “FlowMap: an optimal technology map** algorithm for delay optimization in lookup-table based FPGA designs,” Trans. CAD, vol. 13, no. 1, pp. 1–12, 1994.
- [12] A. H. Farrahi and M. Sarrafzadeh, “Complexity of the lookup-table minimization problem for FPGA technology map**,” IEEE Trans. CAD, 1994.
- [13] E. Lehman, Y. Watanabe, J. Grodstein, and H. Harkness, “Logic decomposition during technology map**,” Trans. CAD, 1997.
- [14] G. Chen and J. Cong, “Simultaneous logic decomposition with technology map** in FPGA designs,” in Proc. FPGA, 2001, p. 48–55.
- [15] S. Chatterjee, A. Mishchenko, R. Brayton, X. Wang, and T. Kam, “Reducing structural bias in technology map**,” in Proc. ICCAD, 2005.
- [16] A. Mishchenko, R. Brayton, A. Tempia Calvino, and G. De Micheli, “Boolean decomposition revisited,” in Proc. IWLS, 2023.
- [17] A. Mishchenko and R. Brayton, “Scalable logic synthesis using a simple circuit structure,” in Proc. IWLS, 2006.
- [18] J. Cong, C. Wu, and Y. Ding, “Cut ranking and pruning: Enabling a general and efficient FPGA map** solution,” in Proc. FPGA, 1999.
- [19] A. Mishchenko, S. Cho, S. Chatterjee, and R. Brayton, “Combinational and sequential map** with priority cuts,” in Proc. ICCAD, 2007.
- [20] G. De Micheli, R. Brayton, and A. Sangiovanni-Vincentelli, “Optimal state assignment for finite state machines,” Trans. CAD, vol. 4, no. 3, pp. 269–285, 1985.
- [21] T. Villa and A. Sangiovanni-Vincentelli, “NOVA: state assignment of finite state machines for optimal two-level logic implementation,” Trans. CAD, vol. 9, no. 9, pp. 905–924, 1990.
- [22] S. Yang and M. Ciesielski, “Optimum and suboptimum algorithms for input encoding and its relationship to logic minimization,” Trans. CAD, vol. 10, no. 1, pp. 4–12, 1991.
- [23] R. Brayton and A. Mishchenko, “ABC: An academic industrial-strength verification tool,” in Computer Aided Verification, T. Touili, B. Cook, and P. Jackson, Eds., 2010. [Online]. Available: https://github.com/berkeley-abc/abc
- [24] L. Amarù, P.-E. Gaillardon, and G. D. Micheli, “The EPFL combinational benchmark suite,” in Proc. IWLS, 2015.
- [25] A. Mishchenko, S. Chatterjee, and R. Brayton, “FRAIGs: A unifying representation for logic synthesis and verification,” EECS Dep., UC Berkeley, Tech. Rep., 2005.
- [26] W. Yang, L. Wang, and A. Mishchenko, “Lazy man’s logic synthesis,” in Proc. ICCAD, 2012, p. 597–604.
- [27] Z. Huang, L. Wang, Y. Nasikovskiy, and A. Mishchenko, “Fast boolean matching based on NPN classification,” in Intern. Conf. on Field-Programmable Technology, 2013.
- [28] M. Soeken, A. Mishchenko, A. Petkovska, B. Sterin, P. Ienne, R. K. Brayton, and G. De Micheli, “Heuristic NPN classification for large functions using AIGs and LEXSAT,” in Theory and Applications of Satisfiability Testing, N. Creignou and D. Le Berre, Eds., 2016.
- [29] L. Fan and C. Wu, “FPGA technology map** with adaptive gate decomposition,” in Proc. FPGA, 2023, p. 135–140.
- [30] “EPFL synthesis competition best results [2023].” [Online]. Available: https://github.com/lsils/benchmarks/tree/v2023.1/best_results
- [31] A. Mishchenko, R. Brayton, J.-H. R. Jiang, and S. Jang, “Scalable don’t-care-based logic optimization and resynthesis,” ACM Trans. Reconfigurable Technol. Syst., vol. 4, no. 4, 2011.
- [32] B. Schmitt, A. Mishchenko, and R. Brayton, “SAT-based area recovery in structural technology map**,” in Proc. ASP-DAC, 2018, pp. 586–591.