\addbibresource

software.bib \addbibresourcesample-base.bib \addbibresourcebiblio.bib

ASCENT: Amplifying Power Side-Channel Resilience via Learning & Monte-Carlo Tree Search

Jitendra Bhandari¹^*,Animesh Basak Chowdhury¹^*, Ozgur Sinanoglu², Siddharth Garg¹,
Ramesh Karri¹, Johann Knechtel² ¹New York University, USA, ²New York University Abu Dhabi, UAE

(2024)

Abstract.

Power side-channel (PSC) analysis is pivotal for securing cryptographic hardware. Prior art focused on securing gate-level netlists obtained as-is from chip design automation, neglecting all the complexities and potential side-effects for security arising from the design automation process. That is, automation traditionally prioritizes power, performance, and area (PPA), sidelining security. We propose a “security-first” approach, refining the logic synthesis stage to enhance the overall resilience of PSC countermeasures. We introduce ASCENT, a learning-and-search-based framework that (i) drastically reduces the time for post-design PSC evaluation and (ii) explores the security-vs-PPA design space. Thus, ASCENT enables an efficient exploration of a large number of candidate netlists, leading to an improvement in PSC resilience compared to regular PPA-optimized netlists. ASCENT is up to 120x faster than traditional PSC analysis and yields a 3.11x improvement for PSC resilience of state-of-the-art PSC countermeasures.

Hardware Security, Power Side-Channel, Logic Synthesis, Design-Space Exploration, Monte Carlo Tree Search.

^†^†conference: Make sure to enter the correct conference title from your rights confirmation emai; June 03–05, 2018; Woodstock, NY^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†copyright: acmlicensed^†^†journalyear: 2024^†^†doi: XXXXXXX.XXXXXXX¹¹footnotetext: J. Bhandari and A. B. Chowdhury contributed equally to this work.

1. Introduction

Hardware implementations of cryptographic and other sensitive algorithms are well-known to be vulnerable to side-channel attacks. Kocher et. al (kocher) first demonstrated the power side-channel attack (PSCA) by exploiting variations in power profiles to extract the secret keys; various advanced PSCA versions followed throughout the years (survey-PSC). To counter such attacks, there are many ongoing efforts to enhance the PSCA resilience of hardware implementations. For example, the state-of-the-art (SOTA) countermeasures (mask-2005; Moos_Moradi_2021; 7324539) augment secret data with random noise to obscure power profiles from secret data. However, all these countermeasures incur large power, performance, and area (PPA) overheads.

Refer to caption — Figure 1. PSCA resilience post-integration of the QuadSeal countermeasure (7324539) vs. area-delay of various AES netlists.

To tackle PPA overheads, security researchers typically take the outputs of chip design automation processes—which are optimized for PPA by default—as a starting point, perform PSCA analysis, and then propose/apply countermeasures to mitigate PSCAs. However, such an approach can easily overlook circuit configurations that might be inherently more more effective to support the resilience of PSCA countermeasures, even if they have some PPA disadvantages. We demonstrate this in Figure 1, where we show the area-delay plot of various synthesized netlists versus the final PSCA resilience, for a representative AES hardware implementation with the QuadSeal countermeasure (7324539) applied (see also Sec. 2.3 for the latter).

Our findings in Figure 1 clearly demonstrates that optimizing the baseline netlist for PPA alone does not guarantee the strongest possible defense against PSCAs in the end. This highlights the need to explore alternative circuit configurations for inherently better PSCA resilience, leading to our core research questions:

(1)

What is the right circuit implementation to start with, such that—post-countermeasure application—the hardware would have the best PSCA resilience?
(2)

How can we guide chip design automation to generate such highly PSCA-resilient circuits, yet with low PPA overheads?

In this work, we systematically tackle the fact that different logic synthesis approaches can significantly impact PSCA resilience (for better or worse). This is because different optimization steps result in different netlists with varying characteristics for the type, number, driver strengths, etc. of the standard cells used. Naturally, all these directly impact the power profiles, thereby impacting the prospects of PSCAs as well. However, the related design-space exploration is a tedious, parameter-rich challenge. Furthermore, with commercial synthesis tools, we face a black-box optimization problem, where the relationship between synthesis choices and PSC resilience is difficult to model.

To address these challenges, we propose ASCENT, a framework for amplifying PSC resilience via learning and Monte-Carlo tree search (MCTS) (Figure 2). Our key contributions are:

•

ASCENT provides a hybrid learning-and-search approach, enabling efficient and effective exploration of the security-centric design space. ASCENT enables us to explore 120 $\times$ more configurations compared to a naive search.
•

ASCENT helps us to achieve up to 3.11 $\times$ improvements in PSC resilience for SOTA countermeasure integration when compared to regular, PPA-optimized baseline netlists. At the same time, the PPA impact is well controlled and limited by ASCENT, namely only up to 6.61% more area.
•

We open-source ASCENT as a commitment towards reproducible research. All the algorithms, experiments, and benchmarks are publicly available at https://github.com/NYU-MLDA/scarl.git.

2. Background and Motivation

2.1. Power Side-Channel Attacks (PSCAs)

Cryptographic hardware is vulnerable to PSCAs, which exploit the fluctuations in a device’s power consumption to reveal sensitive information like keys (survey-PSC). Thus, these attacks leverage the fundamental connection between a device’s power consumption and its internal state during operations. There are different types of PSCAs, including simple power analysis (SPA), differential power analysis (DPA), and correlation power attack (CPA) (brier2004CPA).¹¹1SPA directly interprets the power profiles during specific cryptographic operations, aiming to deduce sensitive data. DPA goes further by comparing power consumption across multiple similar operations with varying inputs. This technique seeks to isolate variations in power profiles that directly correlate with the influence of the secret key. CPA employs statistical tools, most commonly the Pearson correlation coefficient (PCC), to match hypothesized power consumption patterns against the actual power measurements. The highest correlation often reveals the correct key. Note that we utilize CPA in this work; more details are provided further below.

Technology Implications. With continuous advancements for the ever-shrinking technology nodes, the threat of static power side-channel attacks (S-PSCAs) has significantly increased over the years (leakage-2010-TCAS; amstatic2014; giorgetti2007). Unlike dynamic PSCAs, which focus on power fluctuations during active computations, S-PSCAs exploit the relationship between stored data and static power consumption (leakage power). More specifically, advanced nodes utilize standard cells of various types with different power-performance characteristics, e.g., low-threshold voltage (LVT) and ultra-low threshold voltage (ULVT) cells are faster than the regular (RVT) cells, thereby hel** to meet faster timing constraints, albeit at the expense of significantly higher leakage power. LVT and ULVT cells are essential for timing closure, i.e., the careful final-stage efforts in design automation. In short, these implications highlight the need for dedicated countermeasures that protect against S-PSCAs.

Countermeasures. To combat S-PSCAs, security-aware designers can employ masking (mask-2005; bhandari2024lightweight), shuffling, and/or balancing (Moos_Moradi_2021; 7324539) schemes.²²2Further countermeasures and details are discussed in Section 2.2. Also note that the specific countermeasures employed for this work are discussed in Section 2.3. In general, these strategies seek to obscure the power profiles. Despite their demonstrated effectiveness, they all increase design overheads considerably, necessitating to strive a careful balance between security and PPA during the design process.

As indicated, here we tackle this challenge through our novel, learning-and-search based framework for logic synthesis.

Simulation-Based Power Analysis. This is crucial for understanding a design’s vulnerability to PSCAs before investing into actual tape-outs. Commonly utilized procedures work along the following lines; we employ such a procedure as well in this work.

First, through gate-level simulations, a value change dump (VCD) file is obtained. This captures all the relevant state information of the device under test. In combination with the post-synthesis netlist and the library files, this allows for accurate power analysis. Second, power simulation tools calculate the power consumption of each cell, including static/leakage power and internal power from input/output switching transitions. Importantly, for S-PSCA assessments, zero-delay simulations enable precise static power capture during specific operations (unlike an averaged leakage power analysis provided by full-timing simulations).

Correlation Power Analysis (CPA). This attack hinges on identifying a correlation between a device’s power consumption and the intermediate data processed throughout cryptographic operations. An attacker collects power consumption data and then hypothesizes on all possible intermediate data values, which often involve direct correlations to parts or derivatives of the secret key.

More specifically, power consumption is predicted for each hypothetical intermediate value, typically using models like Hamming weight or Hamming distance, which relate binary data representation to power. The core of CPA involves calculating the PCC between actual power measurements and the predictions for each hypothesis. The hypothesis yielding the highest correlation often represents the correct assignment for some part/derivative of the key, allowing the attacker to reconstruct the full key eventually.

2.2. Related Work

S-PSC Attacks and Countermeasures. (giorgetti2007) showed, for the first time, the potential of S-PSCAs as a severe threat. (amstatic2014) have conducted one of the first practical experiments for S-PSCAs using FPGAs, with some follow-up work presented in (7927198). (leakage-2010-TCAS) highlighted the importance of leakage power and its effect on PSCAs especially for advanced nodes. (Moos_Moradi_2021) proposed various countermeasures against S-PSCAs, albeit with considerable PPA overheads. (bhandari2023lightweight) have shown the impact of various types of standard cells on S-PSCAs. (bhandari2024lightweight) proposed a lightweight masking scheme against S-PSCAs. (Karimi_Moos_Moradi_2019) have demonstrated the important side-effect of aging for S-PSCAs in advanced technology nodes. (10.1007/978-3-319-57339-7_5) studied multivariate techniques focused on leakage power consumption to enhance cryptographic security assessments. (9040870; cryptography5030016) proposed standard-cell, delay-based dual-rail pre-charge logic (SC-DDPL) as countermeasure. However, due to its structural complexity, this countermeasure is incompatible with commercial design flows.

Design Frameworks for Advancing PSC Resilience. (karna) introduced a framework that scores and optimizes design parts to minimize PSC vulnerabilities. This approach is limited by the impractical assumption of timing slacks being ubiquitously available. It also lacks an actual PSC evaluation. (rtl_psc) proposed a framework for assessing PSC vulnerabilities at the register-transfer level (RTL), with the goal to aid countermeasure implementation. (7364404) studied circuit replication and SRAM sharing for PSC resilience in FPGAs. This approach notably increases design costs but significantly improves security while maintaining FPGA configurability. (10.1145/3488932.3517415) emphasized the importance of automated modeling for early, system-level detection of potential leaks. (Tiri2004SecureLS) proposed a methodology for so-called wave dynamic differential logic (WDDL) on FPGAs.

Summary. Prior art for S-PSC countermeasures is focused on detailed empirical studies, with only limited considerations for generalized design-time integration of the countermeasures. At the same time, prior art for frameworks is limited to D-PSCAs, not S-PSCAs, and was proposed for high-level design stages and/or for FPGAs. Thus, there is no prior art that proposes a security-first approach toward the complex, yet critical, challenge of design-space exploration for S-PSCA countermeasure integration in ASICs. Also recall the exploratory finding from Section 1. This gap provides the main motivation for our work.

2.3. Representative Countermeasures

As motivated in Sections 2.1 and 2.2, S-PSCAs are becoming ever-more relevant for advanced technology nodes, and various countermeasures have been proposed. In our study, we consider the following two representative, SOTA countermeasures against S-PSCAs.³³3While these two SOTA countermeasures appear similar from a high-level view, their implementation details and, thus, efficiency to hinder S-PSCAs still differ (Moos_Moradi_2021, Table 5). Importantly, ASCENT is agnostic to the countermeasures a designer likes to explore and eventually integrate.

Quadruple Algorithmic Symmetrizing (QuadSeal) (7324539). This technique can protect against both dynamic and static PSCAs. It focuses on achieving a balance in Hamming weights/distances for the cryptographic operations in hardware. It operates by quadrupling the unprotected circuit structure and balancing the arrangement of the so-called substitution boxes (S-Boxes) in three of those circuit copies in a specific manner. Additionally, it involves rotating inputs to the resulting balanced structure, to mitigate other real-world dependencies introduced by, e.g., manufacturing process variations, timing-path imbalances, aging, etc.

Exhaustive Logic Balancing (ELB) (Moos_Moradi_2021). To reduce the correlation between input data and the leakage current of standard cells, selected sensitive cells are duplicated and fed with inverted input data, which is akin to differential logic. Cell duplication is scaled based on the number of possible input vectors—a single-input cell is duplicated once, while two-input cells are quadrupled. This, along with the inverted inputs, ensures a constantly uniform input distribution across all related cells.

2.4. Logic Synthesis

Logic synthesis transforms a high-level hardware design (e.g., in RTL) into an optimized and technology-specific gate-level netlist. This process offers significant flexibility, as any single design can be mapped to many functionally equivalent but structurally different netlists, all with distinct PPA characteristics. Toward that end, so-called recipes are devised, which are sequences of optimization steps. However, the sheer number of possible recipes and netlists makes this a complex problem, in fact a $\Sigma^{P}_{2}$ -complete problem.

Setting. Commercial tools like Synopsys DC and Cadence Genus are tuned for PPA optimization and use proprietary optimization algorithms toward that end. Aside from scripting interfaces tailored for such PPA optimization, these tools lack direct mechanisms to tune synthesis for other objectives like PSC resilience.

Thus, research into novel optimization techniques, including our work on security-first synthesis, often relies on the open-source and customizable Yosys framework (yosys). In fact, Yosys is the most widely adopted, SOTA synthesis framework for and by academia.

AIG Representation, Generic Problem Formulation. Within Yosys, ABC (abc) is used for combinational optimization. First, ABC converts the design into a homogeneous logic-network implementation called and-inverted graph (AIG). Next, ABC’s algorithms employ various transformations at the sub-graph-level (abc). Importantly, users are free to tweak these algorithms in general and the selection and order of transformations in particular, all to optimize the AIG circuit representation according to their objectives.

More formally, in line with earlier works (chowdhury2021openabc; chowdhury2022bulls; chowdhury2024retrieval) a synthesis recipe $a^{T}$ is a sequence of $T$ transformation steps operating on an AIG structure to optimize for PPA and/or other metrics (e.g. security (basak2023almost)), all while preserving the original functionality. We denote $\mathcal{A}$ of $\mathbf{L}$ unique synthesis transformations, $\{a_{0},a_{1},\ldots,a_{L-1}\}$ ( $a_{i}\in\mathcal{A}$ ), in a synthesis recipe $a^{T}$ . Thus, the number of synthesis recipes of length $T$ is $\mathbf{L}^{\mathbf{T}}$ , including repeatable transformations. This search space is denoted by $\mathcal{A}^{T}$ . The problem of generating a PPA-optimal synthesis recipe for an AIG is:

(1)

\displaystyle\operatorname*{arg\,max}_{a^{T}\in\mathcal{A}^{T}}PPA(AIG_{T}),\ % \ s.t.\,\,AIG_{t+1}=\eta(AIG_{t},a_{t})\,\forall t\in[0,T-1]

where $\eta$ is the synthesis function defined as $\eta:AIG\times\mathcal{A}\longrightarrow AIG$ .

2.5. Monte-Carlo Tree Search

MCTS is an optimization algorithm best suited for tree-structured search-space exploration. It has been used in selected prior art for logic synthesis (yu2020flowtune; pei2023alphasyn; chowdhury2024retrieval; delorenzo2024make), albeit only for PPA optimization.

Structure. The MCTS search tree contains a root node representing the initial state ( $S_{0}$ ). A node is called leaf if there exist an $a_{t}\in\mathcal{A}$ that still remains unexplored and the node is terminal state. Each node preserves two attributes: (1) node visit count $N(S_{t},a_{t})$ and (2) cumulative reward $R(S_{t},a_{t})$ .⁴⁴4 $N(S_{t},a_{t})$ is the number of times the nodes is visited during exploration. $R(S_{t},a_{t})$ is the total reward obtained while exploring the sub-tree rooted at that node.

MCTS operates in four stages as follows.

1) Selection: of the “most promising node” in the MCTS tree until a leaf node is reached for further exploration. The selection is based on the upper confidence tree (UCT) computation as follows:

(2)

\pi_{MCTS}(S_{t})=\operatorname*{arg\,max}_{a_{t}\in\mathcal{A}}\left(% \underbrace{\frac{R(S_{t},a_{t})}{N(S_{t},a_{t})}}_{\text{Exploitation}}+c\dot% {\underbrace{\sqrt{\frac{\log\sum_{a_{t}\in\mathcal{A}}N(S_{t})}{N(S_{t},a_{t}% )}}}_{\text{Exploration}}}\right)

That is, UCT computation considers exploitation and exploration.

Exploitation computes the ratio of the reward $R(S_{t},a_{t})$ accumulated over the sub-tree rooted at that node and the visits count $N(S_{t},a_{t})$ . This average reward is obtained by exploring the sub-tree which, in turn, is done by picking an action $a_{t}$ from state $S_{t}$ .

Exploration prioritizes nodes which have been less explored so far. It loosely computes the ratio of the parent’s visit count ( $\sum_{a_{t}\in\mathcal{A}}N(S_{t},a_{t})$ ) and the current node’s visit count ( $N(S_{t},a_{t})$ ). Thus, the score increases if the parent node has been frequently visited whereas the current node has been less explored.

Starting from state $S_{0}$ , $\pi_{MCTS}$ navigates the search space by selecting the “best” action $a_{t}$ that achieves the maximum score combining exploitation and exploration terms (Equation 2), effectively performing a best-first exploration. The selection process continues until a leaf node is encountered.

2 and 3) Expansion and Rollout: Once a leaf node is selected, an action $a_{t}$ is chosen (at random) from the set of unexplored actions. A node is added to the MCTS tree and node attributes are initialized. We call the trajectory $\tau$ a sequence of nodes visited during MCTS selection and expansion, which is represented by $\{S_{0},(S_{0},a_{0}),(S_{1},a_{1})..,(S_{\tau},a_{\tau})\}$ .

4) Backpropagation: After computing the score for the terminal state $S_{T}$ , the trajectory $\tau$ is backtracked from the leaf node till the root node. The cumulative reward $R(S_{t},a_{t})$ and node visit count $N(S_{t},a_{t})$ for each node in $\tau$ are updated with $R(S_{T})$ and $1$ , respectively. Next, the process repeats from Stage 1) again.

3. Problem Formulation

We believe that logic synthesis offers ample opportunities to discover inherently more S-PSCA-resilient netlist structures that can amplify the resilience of SOTA countermeasure even further. However, while traditional PPA-focused synthesis does impact S-PSCA outcomes, recall that a dedicated security-first approach is fundamentally missing (Section 2.2). We understand and emphasize that realizing such an approach requires enormous efforts in practice. This is due to two key facts:

(1)

synthesis in general is already complex and its search-space computationally expansive to explore (see Section 2.4);
and, coming on top,
(2)

actual S-PSC evaluation, which is essential for accurate guidance for a security-first synthesis method, incurs significant further computation cost (see further below).

Figure 3 illustrates this problem for an exemplary, PSC-aware synthesis framework. Next, we formalize this problem in general. Subsequently, we evidence the practical challenges outlined above in more detail. We also indicate on the techniques we utilize to address these. Finally, we provide the specific problem formulation, leading to the proposed ASCENT framework.

Formulation. Our goal is to find a netlist that maximizes S-PSCA resilience after application of SOTA S-PSCA countermeasure. To address these challenges, we formulate security-first synthesis as an optimization problem, guided by a Markov decision process (MDP), with distinct states, actions, transitions, and rewards.

•

State $S_{t}$ at step $t$ is the AIG of the design $D$ after applying a partial synthesis recipe of length $t$ . $AIG_{0}$ is the initial AIG extracted from $D$ . The terminal state $AIG_{T}$ is the AIG generated after applying a synthesis recipe of maximum length $T$ .
•

Actions $\mathcal{A}$ is the set of $L$ functionality-preserving transformations $\{a_{0},a_{1},\ldots,a_{L-1}\}$ ( $a_{i}\in\mathcal{A}$ ) provided by a synthesis tool.
•

State transition $\eta(S_{t+1}|S_{t},a_{t})$ is the transformation by applying action $a_{t}$ on state $S_{t}$ resulting in state $S_{t+1}$ . Here, the transition function yields deterministic AIG.
•

Reward expresses the S-PSCA resilience as so-called $PT_{score}$ , i.e., the number of power traces required for successful key extraction, of the post-countermeasure netlist. We consider a delayed reward model and assign zero reward to every action until we reach terminal state $S_{T}$ .

We define the overall problem as $PT_{score}$ maximization:

(3)

\displaystyle\operatorname*{arg\,max}_{a_{T}\in\mathcal{A}^{T}}PT_{score}(% \mathcal{C}(S_{T})),\,\,s.t.\,\,S_{t+1}=\eta(S_{t},a_{t})\,\,\forall t\in[0,T-1]

where $\mathcal{C}$ is the countermeasure applied on the synthesized netlist.

Practical Challenges. As indicated, there are critical barriers in terms of computational complexity associated with this problem. For example, the search space for synthesis in general is of complexity $L^{T}$ , where $L=13$ and $T=18$ ,⁵⁵5This exemplary choice of $L=13$ and $T=18$ is in line with the length of synthesis recipes and unique synthesis transformations available for Yosys’ compress2rs recipe. is approximately $\sim 10^{19}$ .

As our problem has a non-analytical form and no closed-form solution is available, we must rely on gradient-free optimization methods. This mandates means for inexpensive reward evaluation. However, our experimentation shows that running an accurate PSC attack evaluation on post-countermeasure netlists requires up to 100k test vectors, which takes $\approx 6$ hours of simulation runtime. Thus, even when evaluating only 100 samples using any gradient-free optimizer, the process would take $\approx 25$ days.

These important observations raise the following questions toward computationally-efficient S-PSC-aware logic synthesis:

•

Given a runtime budget, how can we quickly, yet accurately, evaluate S-PSC attacks for some post-countermeasure netlist?
•

How can we efficiently explore the search space of synthesis recipes to obtain S-PSC-resilient netlists?

Outline of Method. Addressing these challenges necessitate an optimization approach that balances search efficiency with accurate S-PSCA assessment. This leads us to the design of ASCENT (Section 4), which employs a hybrid learning-and-search strategy. More specifically, ASCENT utilizes (i) a zero-shot predictor $\hat{PT}_{score}(\mathcal{C}(S_{T}),\theta)$ to significantly speed up the PSC evaluation, without loss of accuracy, and (ii) MCTS (Section 2.5) to explore the large and complex search space for security-first synthesis in an effective and efficient manner.

Extended Formulation. Assume a predictor $\hat{PT}_{score}(\mathcal{C}(S_{T}),\theta)$ , which predicts the number of power traces required for PSCAs for a design $\mathbf{D}$ synthesized using a $T$ -length recipe. Then, we seek to solve the following problem:

(4)

\displaystyle\operatorname*{arg\,max}_{a_{T}\in\mathcal{A}^{T}}\hat{PT}_{score% }(\mathcal{C}(S_{T}),\theta),\,\,s.t.\,\,S_{t+1}=\eta(S_{t},a_{t})\,\,\forall t% \in[0,T-1]

where $\mathbf{\theta}$ represents the predictor’s parameters. In simple terms, the predictor shall serve as a computationally efficient surrogate for direct S-PSCA evaluation.

4. ASCENT Framework

We solve the problem formulation in Equation 4 in three steps, as illustrated in Figure 4. Next, we outline these three steps.

First, we provide an exploratory experiment which clearly demonstrates that maximizing the S-PSCA resilience of a netlist post-countermeasure integration, i.e., maximizing $PT_{score}(\mathcal{C}(S_{T}))$ , is the same as maximizing the $PT_{score}(S_{T})$ . Therefore, we train a zero-shot $\hat{PT}_{score}$ predictor using $PT_{score}$ values obtained from diverse synthesized netlists. We show that it suffices to start with some representative data points to train such zero-shot predictor.

Second, we employ the zero-shot predictor as S-PSCA evaluator for the MCTS-based exploration of the search space. This predictor within MCTS is essential to drive the sequential decision-making process. Note that we also improve the predictor on the fly: we simulate the actual power traces for the top- $k$ and bottom- $k$ recipes during the MCTS process and accordingly fine-tune the predictor.

Third, we put all parts together and conduct an end-to-end optimization process, including a final validation of the PSC resilience by actual PSC evaluation after the optimization process is done.

4.1. Zero-Shot Predictor (➊)

To obtain a zero-shot predictor, we have to collect historical S-PSCA data in a one-time effort. Key challenges here are (i) high runtime cost of S-PSCA evaluation even for such one-time effort and (ii) ensuring a representative dataset of diverse netlists with varying S-PSCA resilience. Next, we discuss how we tackle these challenges and finally outline the actual training approach.

4.1.1. Pre- vs Post-Countermeasure Evaluation

We observe a monotonic relationship between pre- and post-countermeasure $PT_{score}$ values. Figure 5 and Figure 6 show $PT_{score}$ values for 50 randomly synthesized netlists pre- and post-countermeasure application for the two representative countermeasures of choice.⁶⁶6The gaps between 6k and 7k for the baseline/pre-countermeasure netlists are only due to chance, i.e., none of the 50 random recipes provide scores in that ranges.

Importantly, this confirms that aiming for pre-countermeasure resilience suffices for a guided design-space exploration toward best post-countermeasure resilience. This observation helps to significantly limit the runtime cost for obtaining training data. In fact, running S-PSCA evaluations on pre-countermeasure netlists provides a 7.2x speed-up: it takes only 40–50 minutes as compared to around 6 hours for post-countermasure netlists. This is due to the lower resilience of the pre-countermeasure netlists; S-PSCAs can find the correct key with less traces and in shorter time.

4.1.2. Dataset Diversity

We are inspired by the sampling approaches described in (bai2023towards). These works inspire us to focus on exploring netlists that result in diverse scores. Without loss of generality, we utilize simulated annealing (SA) toward this end. Note that we tune the annealing scheduling for more diversity during exploration.

4.1.3. Training of the Predictor

Our predictor is a regressor model. It uses the pre-countermeasure netlists, three handcrafted features, and $PT_{score}$ values (obtained by actual S-PSCA evaluation) as labels. The features are: (1) Overall Diversity: count of the various cell types found in the synthesized netlists; (2, 3) Specific Diversity: percentage of area consumed by LVT (2) and HVT (3) cells. We trained and evaluated various predictor versions, exploring a wide range of other hand-crafted features (e.g., area, delay, etc.) as well as more synthesized design and more varied recipes. However, the best performance, as in the best prediction of post-countermeasure resilience based on pre-countermeasure resilience, was obtained for the three features above.

4.2. Monte-Carlo Tree Search & Fine-Tuning (➋)

Having established a fast and accurate zero-shot predictor, we now have to integrate it into an search strategy for maximizing PSCA resilience. \AcfMCTS, with its ability to intelligently balance exploration and exploitation, is a natural fit for this complex task. Its delayed reward model aligns with our problem.

4.2.1. MCTS Implementation

We employ MCTS as outlined in Sections 2.5 and 3. More details are provided next.

We implement the critical reward component as follows. We assign normalized $PT_{score}(S_{T})$ values, obtained from the zero-shot predictor $PT_{score}$ , to a terminal state $S_{T}$ .

(5)

PT^{norm}_{score}(S_{T})=\begin{dcases}0&\text{if }PT_{score}\leq PT_{% threshold}\\ \frac{PT_{score}}{PT_{threshold}}&\text{otherwise}\end{dcases}

In plain words, we compare the predicted scores against a user-defined threshold ( $PT_{threshold}$ ) and assign $0$ reward if it is less than the threshold. This reward formulation allows MCTS to skip any unpromising paths in the search space. The reward for promising paths are normalized so that the UCT computation maintains the balance for exploitation and exploration.

For the expansion and rollout stages, note the following. A sequence of actions are taken which in turn synthesize the netlist until the terminal state is reached. Then, the $\hat{PT}_{score}$ predictor is used to obtain $PT_{score}$ values of the netlist. A new node is added and assigned the updated $R(S_{t},a_{t})$ vale.

4.2.2. Online Fine-Tuning of Predictor

As indicated, we simulate the actual power traces for the top- $k$ and bottom- $k$ recipes during the MCTS process and accordingly fine-tune the predictor $\hat{PT}_{score}$ . This is done to ensure the predictor is continually updated with relevant corner-case data points, all without hampering the ongoing search-space exploration. For efficiency, we conduct the S-PSCA evaluation for these netlist in parallel.

4.2.3. Justification of MCTS

We note that reinforcement learning (RL)-based methods like (hosny2020drills; zhu2020exploring) are gaining significant attention for advancing logic synthesis. Important key differences for our work over these works are as follows, making MCTS uniquely suitable.

In RL-based methods, the asynchronous actor-critic approach encounters challenges in delayed reward systems. Thus, RL-based methods tuned for PPA optimization typically assign immediate rewards, like reduced depth of AIGs, which also correlate well with the final PPA. However, there is no such direct correlation between AIGs and S-PSCA resilience. Therefore, we cannot utilize immediate rewards and, thus, chose MCTS. In fact, its notion of back-propagation (Section 2.5) helps to accurately estimate average rewards even for intermediate nodes in the search tree.

4.3. Integration and End-to-End Validation (➌)

After completion of the MCTS process, we obtain the best synthesis recipe using a greedy approach following standard procedures (most-rewarding-node selection and most-visited-node selection). This recipe is expected to generate the most S-PSCA resilient design post-countermeasure application. Finally, we synthesize the circuit using this recipe, apply the countermeasure of choice, and run an actual S-PSCA evaluation to report the final $PT_{score}$ .

5. Experiments

5.1. Setup

AES Implementation. We use non-masked Electronic Code Book (ECB) mode AES with S-Box as a look-up table. ECB mode pads data until the length of the block which is 128.

We use a commercial 28nm technology library, focusing on the TT corner (25 degrees Celsius, 0.9V), and using different VT cells.

S-PSCA Setup. We implement a C++-based CPA attack following (knechtel22_PSC). It incrementally increases the number of traces through coarse and thorough sampling for 128 trials, aiming for a 90% success rate with thorough sampling. In other words, the attack is thoroughly assessed in multiple runs, not only one-shot trials.

The total number of traces depends on the case study; for baseline AES, up to 10K traces are sufficient, whereas QuadSeal and ELB countermeasures on top require 50K and 100K traces, respectively.

$PT_{score}$ Predictor. We collected a corpus of 1,000 random, different synthesis recipes for the baseline AES, with features extracted as in Section 4. This took us $\approx$ 30 days. We split the dataset into 80% and 20% for training and testing, respectively.

We used XGBoost (chen2016xgboost), a scalable and distributed, gradient-boosted decision tree (GBDT) library to implement the predictor model. We trained our model on an AMD EPYC 7542 server equipped with 128 CPUs and 1TB RAM, running Red Hat Enterprise Linux Server Release 7.9 (Maipo). The CAD flow utilized both open-source and commercial tools, including Synopsys VCS M-2017.03-SP1 for RTL and gate-level simulations, Yosys for logic synthesis, and Synopsys PrimeTime PX M-2017.06 for power simulations.

ASCENT Framework. We developed ASCENT in-house by implementing MCTS algorithm and plugged it with $PT_{score}$ . We compute the handcrafted features generated from the synthesized netlist and pass it to $PT_{score}$ which provides quick feedback with a latency of 100 ms. We use Yosys (v0.38) to perform end-to-end synthesis which takes roughly $\sim$ 2.8 minutes on a single thread CPU run on our Intel server (Frequency: 2.3GHz, Memory: 256GB).

5.2. Results

5.2.1. $PT_{score}$ Predictor

Figure 7 shows our predictor’s performance on the test datasets, where the x-axis and y-axis denote “Ground Truth” and “Predicted Scores”, respectively. The Root Mean Squared Error (RMSE) score for the model on the test datasets is 565.58. However, recall that the predictor will be continuously improved by fine-tuning to further enhance its performance.

5.2.2. ASCENT Framework

Table 1. Comparison of

PT_{score}

across recipes. Shown are only recipes with maximal

PT_{score}

Synthesis recipes	Baseline AES				AES + QuadSeal				AES + ELB
	PT	Area	Power	Delay	PT	Area	Power	Delay	PT	Area	Power	Delay
	score	( $\mu$ m²)	(mW)	(ns)	score	( $\mu$ m²)	(mW)	(ns)	score	( $\mu$ m²)	(mW)	(ns)
compress2rs	2800	12391	0.65	0.28	12500	16430	0.83	0.31	28000	24321	1.31	0.36
resyn2rs	2900	12380	0.62	0.3	13000	16712	0.86	0.33	29000	24627	1.39	0.39
SA	4900	12542	0.73	0.31	17500	17123	0.91	0.34	42000	25242	1.48	0.41
MCTS	5300	13723	0.76	0.31	18500	18204	0.93	0.35	47000	26111	1.54	0.43
ASCENT	9100	13155	0.74	0.3	42500	18097	0.92	0.35	87000	25928	1.53	0.44

We started with running the S-PSCA evaluation for the compress2rs recipe, which we are considering as the baseline. This will be compared against for all the methods used in this work. The recipe for compress2rs is provided in (github). We consider three different settings for our experiments, namely {AES, AES+QuadSeal, AES+ELB}, where AES denotes the baseline without countermeasures, AES+QuadSeal denotes AES with the countermeasure QuadSeal, and AES+ELB represents AES with countermeasure ELB incorporated (Section 2.3).

Table 1 details our results and comparison across the various methods studied in this work. To better understand the results, we have compared the $PT_{score}$ and overheads with the results from the baseline compress2rs run. For example for the baseline AES, we would compare the results of each method with the corresponding values from compress2rs. Here we observed {1.04 $\times$ , 1.04 $\times$ , 1.04 $\times$ } improvement in the $PT_{score}$ values, along with an additional {-0.09%, 1.7%, 1.26%} area, {-4.6%, 3.61%, 6.11%} power, and {7.14%, 6.45%, 8.33%} delay overhead across those 3 different scenarios, respectively. Thus, with the introduction of the countermeasures, we can see that the $PT_{score}$ is improved; this is expected.

We then utilize the same black-box optimizer that was used for our predictor model, SA, to also explore the security-first search space. Allowing a timeout of 3 days, there is an improvement across all three scenarios: {1.75 $\times$ , 1.4 $\times$ , 1.5 $\times$ } higher scores, albeit at an overhead of {1.22%, 4.22%, 3.79%} area, {12.31%, 9.64%, 12.98%} power, and {10.71%, 9.68%, 13.89%} delay, respectively.

We then utilized our ASCENT framework to obtain the synthesis recipes with maximized $PT_{score}$ values. Here, an increase in resilience of {1.89 $\times$ , 1.48 $\times$ , 1.68 $\times$ }, along with {10.75%, 10.78%, 7.36%} area, {16.92%, 12.05%, 17.56%} power, and {10.71%, 12.90%, 19.44%} delay overheads, respectively, are obtained.

Table 2. Runtime performance achieved by ASCENT and competitive methods. Iterations denote the number of times the full S-PSCA evaluation can be run. We set a common timeout of 72 hours for fair comparison.

Method	Iterations	Speed-up
SA (AES with countermeasure)	12	1.0 $\times$
MCTS (AES with countermeasure)	12	1.0 $\times$
SA (baseline AES)	108	9.0 $\times$
MCTS (baseline AES)	107	8.3 $\times$
ASCENT	1460	121.3 $\times$

Finally, we used our full ASCENT framework to more effectively explore the design space, again with timeout of 3 days. Recall that, thanks to the predictor model, the time-consuming actual S-PSCA evaluation can be bypassed in the MCTS exploration phase. We pick the top-3 recipes in terms of max $PT_{score}$ and run actual S-PSCA evaluation on them, for final validation. From Table 1, it can be seen that, relative to the baseline, we achieved on average across these top-3 recipes {3.25 $\times$ , 3.40 $\times$ , 3.11 $\times$ } higher PSC resilience with an overhead of only {6.16%, 10.15%, 6.61%} area, {13.85%, 10.84%, 16.79%} power, and {7.14%, 12.90%, 22.22%} delay, respectively.

Table 2 reports the runtime comparisons of various methods. The first two cases, { SA (AES with countermeasures), MCTS (AES with countermeasures) } require running the S-PSCA evaluation end-to-end, with countermeasures in place. These required to collect 50k and 100k traces for QuadSeal and ELB, respectively, compared to only 10k traces for the baseline AES. Thus, 5 $\times$ and 10 $\times$ more time is required, respectively. For the remaining two cases, namely the baseline AES for SA versus MCTS, we obtained a speedup of 9.0 $\times$ and 8.3 $\times$ , respectively. In short, ASCENT is able to explore a much larger design space much more quickly when compared to other methods; e.g., 120 $\times$ faster than the default black-box optimizer SA.

6. Conclusion

In this work, we proposed ASCENT, a novel synthesis framework. We have successfully enhanced the final resilience of existing power side-channel countermeasures by carefully guided, yet efficient, exploration of the complex search space. In fact, ASCENT enables a $3.11\times$ post-countermeasure improvement when compared to conventional synthesis (tailored for PPA optimization). For future work, we plan to tailor ASCENT to harden circuits also against other threats like fault injection.

\printbibliography