Search | arXiv e-print repository

Carry Your Fault: A Fault Propagation Attack on Side-Channel Protected LWE-based KEM

Authors: Suparna Kundu, Siddhartha Chowdhury, Sayandeep Saha, Angshuman Karmakar, Debdeep Mukhopadhyay, Ingrid Verbauwhede

Abstract: Post-quantum cryptographic (PQC) algorithms, especially those based on the learning with errors (LWE) problem, have been subjected to several physical attacks in the recent past. Although the attacks broadly belong to two classes - passive side-channel attacks and active fault attacks, the attack strategies vary significantly due to the inherent complexities of such algorithms. Exploring further a… ▽ More Post-quantum cryptographic (PQC) algorithms, especially those based on the learning with errors (LWE) problem, have been subjected to several physical attacks in the recent past. Although the attacks broadly belong to two classes - passive side-channel attacks and active fault attacks, the attack strategies vary significantly due to the inherent complexities of such algorithms. Exploring further attack surfaces is, therefore, an important step for eventually securing the deployment of these algorithms. Also, it is important to test the robustness of the already proposed countermeasures in this regard. In this work, we propose a new fault attack on side-channel secure masked implementation of LWE-based key-encapsulation mechanisms (KEMs) exploiting fault propagation. The attack typically originates due to an algorithmic modification widely used to enable masking, namely the Arithmetic-to-Boolean (A2B) conversion. We exploit the data dependency of the adder carry chain in A2B and extract sensitive information, albeit masking (of arbitrary order) being present. As a practical demonstration of the exploitability of this information leakage, we show key recovery attacks of Kyber, although the leakage also exists for other schemes like Saber. The attack on Kyber targets the decapsulation module and utilizes Belief Propagation (BP) for key recovery. To the best of our knowledge, it is the first attack exploiting an algorithmic component introduced to ease masking rather than only exploiting the randomness introduced by masking to obtain desired faults (as done by Delvaux). Finally, we performed both simulated and electromagnetic (EM) fault-based practical validation of the attack for an open-source first-order secure Kyber implementation running on an STM32 platform. △ Less

Submitted 25 January, 2024; originally announced January 2024.

ACM Class: E.3.3

arXiv:2311.08040 [pdf, ps, other]

On the Masking-Friendly Designs for Post-Quantum Cryptography

Authors: Suparna Kundu, Angshuman Karmakar, Ingrid Verbauwhede

Abstract: Masking is a well-known and provably secure countermeasure against side-channel attacks. However, due to additional redundant computations, integrating masking schemes is expensive in terms of performance. The performance overhead of integrating masking countermeasures is heavily influenced by the design choices of a cryptographic algorithm and is often not considered during the design phase. In… ▽ More Masking is a well-known and provably secure countermeasure against side-channel attacks. However, due to additional redundant computations, integrating masking schemes is expensive in terms of performance. The performance overhead of integrating masking countermeasures is heavily influenced by the design choices of a cryptographic algorithm and is often not considered during the design phase. In this work, we deliberate on the effect of design choices on integrating masking techniques into lattice-based cryptography. We select Scabbard, a suite of three lattice-based post-quantum key-encapsulation mechanisms (KEM), namely Florete, Espada, and Sable. We provide arbitrary-order masked implementations of all the constituent KEMs of the Scabbard suite by exploiting their specific design elements. We show that the masked implementations of Florete, Espada, and Sable outperform the masked implementations of Kyber in terms of speed for any order masking. Masked Florete exhibits a $73\%$, $71\%$, and $70\%$ performance improvement over masked Kyber corresponding to the first-, second-, and third-order. Similarly, Espada exhibits $56\%$, $59\%$, and $60\%$ and Sable exhibits $75\%$, $74\%$, and $73\%$ enhanced performance for first-, second-, and third-order masking compared to Kyber respectively. Our results show that the design decisions have a significant impact on the efficiency of integrating masking countermeasures into lattice-based cryptography. △ Less

Submitted 14 November, 2023; originally announced November 2023.

ACM Class: E.3.3

arXiv:2311.08027 [pdf, other]

A practical key-recovery attack on LWE-based key-encapsulation mechanism schemes using Rowhammer

Authors: Puja Mondal, Suparna Kundu, Sarani Bhattacharya, Angshuman Karmakar, Ingrid Verbauwhede

Abstract: Physical attacks are serious threats to cryptosystems deployed in the real world. In this work, we propose a microarchitectural end-to-end attack methodology on generic lattice-based post-quantum key encapsulation mechanisms to recover the long-term secret key. Our attack targets a critical component of a Fujisaki-Okamoto transform that is used in the construction of almost all lattice-based key e… ▽ More Physical attacks are serious threats to cryptosystems deployed in the real world. In this work, we propose a microarchitectural end-to-end attack methodology on generic lattice-based post-quantum key encapsulation mechanisms to recover the long-term secret key. Our attack targets a critical component of a Fujisaki-Okamoto transform that is used in the construction of almost all lattice-based key encapsulation mechanisms. We demonstrate our attack model on practical schemes such as Kyber and Saber by using Rowhammer. We show that our attack is highly practical and imposes little preconditions on the attacker to succeed. As an additional contribution, we propose an improved version of the plaintext checking oracle, which is used by almost all physical attack strategies on lattice-based key-encapsulation mechanisms. Our improvement reduces the number of queries to the plaintext checking oracle by as much as $39\%$ for Saber and approximately $23\%$ for Kyber768. This can be of independent interest and can also be used to reduce the complexity of other attacks. △ Less

Submitted 14 November, 2023; originally announced November 2023.

ACM Class: E.3.3

arXiv:2305.10368 [pdf, other]

A 334$μ$W 0.158mm$^2$ ASIC for Post-Quantum Key-Encapsulation Mechanism Saber with Low-latency Striding Toom-Cook Multiplication Authors Version

Authors: Archisman Ghosh, Jose Maria Bermudo Mera, Angshuman Karmakar, Debayan Das, Santosh Ghosh, Ingrid Verbauwhede, Shreyas Sen

Abstract: The hard mathematical problems that assure the security of our current public-key cryptography (RSA, ECC) are broken if and when a quantum computer appears rendering them ineffective for use in the quantum era. Lattice based cryptography is a novel approach to public key cryptography, of which the mathematical investigation (so far) resists attacks from quantum computers. By choosing a module lear… ▽ More The hard mathematical problems that assure the security of our current public-key cryptography (RSA, ECC) are broken if and when a quantum computer appears rendering them ineffective for use in the quantum era. Lattice based cryptography is a novel approach to public key cryptography, of which the mathematical investigation (so far) resists attacks from quantum computers. By choosing a module learning with errors (MLWE) algorithm as the next standard, National Institute of Standard & Technology (NIST) follows this approach. The multiplication of polynomials is the central bottleneck in the computation of lattice based cryptography. Because public key cryptography is mostly used to establish common secret keys, focus is on compact area, power and energy budget and to a lesser extent on throughput or latency. While most other work focuses on optimizing number theoretic transform (NTT) based multiplications, in this paper we highly optimize a Toom-Cook based multiplier. We demonstrate that a memory-efficient striding Toom-Cook with lazy interpolation, results in a highly compact, low power implementation, which on top enables a very regular memory access scheme. To demonstrate the efficiency, we integrate this multiplier into a Saber post-quantum accelerator, one of the four NIST finalists. Algorithmic innovation to reduce active memory, timely clock gating and shift-add multiplier has helped to achieve 38% less power than state-of-the art PQC core, 4x less memory, 36.8% reduction in multiplier energy and 118x reduction in active power with respect to state-of-the-art Saber accelerator (not silicon verified). This accelerator consumes 0.158mm2 active area which is lowest reported till date despite process disadvantages of the state-of-the-art designs. △ Less

Submitted 17 May, 2023; originally announced May 2023.

arXiv:2304.09490 [pdf, other]

doi 10.1109/IOLTS59296.2023.10224890

Neural Network Quantisation for Faster Homomorphic Encryption

Authors: Wouter Legiest, Jan-Pieter D'Anvers, Furkan Turan, Michiel Van Beirendonck, Ingrid Verbauwhede

Abstract: Homomorphic encryption (HE) enables calculating on encrypted data, which makes it possible to perform privacypreserving neural network inference. One disadvantage of this technique is that it is several orders of magnitudes slower than calculation on unencrypted data. Neural networks are commonly trained using floating-point, while most homomorphic encryption libraries calculate on integers, thus… ▽ More Homomorphic encryption (HE) enables calculating on encrypted data, which makes it possible to perform privacypreserving neural network inference. One disadvantage of this technique is that it is several orders of magnitudes slower than calculation on unencrypted data. Neural networks are commonly trained using floating-point, while most homomorphic encryption libraries calculate on integers, thus requiring a quantisation of the neural network. A straightforward approach would be to quantise to large integer sizes (e.g. 32 bit) to avoid large quantisation errors. In this work, we reduce the integer sizes of the networks, using quantisation-aware training, to allow more efficient computations. For the targeted MNIST architecture proposed by Badawi et al., we reduce the integer sizes by 33% without significant loss of accuracy, while for the CIFAR architecture, we can reduce the integer sizes by 43%. Implementing the resulting networks under the BFV homomorphic encryption scheme using SEAL, we could reduce the execution time of an MNIST neural network by 80% and by 40% for a CIFAR neural network. △ Less

Submitted 30 August, 2023; v1 submitted 19 April, 2023; originally announced April 2023.

Comments: 5 pages, 2 figures, 3 tables

arXiv:2304.05306 [pdf, other]

doi 10.1109/TIFS.2023.3326986

Optimizing Linear Correctors: A Tight Output Min-Entropy Bound and Selection Technique

Authors: Miloš Grujić, Ingrid Verbauwhede

Abstract: Post-processing of the raw bits produced by a true random number generator (TRNG) is always necessary when the entropy per bit is insufficient for security applications. In this paper, we derive a tight bound on the output min-entropy of the algorithmic post-processing module based on linear codes, known as linear correctors. Our bound is based on the codes' weight distributions, and we prove that… ▽ More Post-processing of the raw bits produced by a true random number generator (TRNG) is always necessary when the entropy per bit is insufficient for security applications. In this paper, we derive a tight bound on the output min-entropy of the algorithmic post-processing module based on linear codes, known as linear correctors. Our bound is based on the codes' weight distributions, and we prove that it holds even for the real-world noise sources that produce independent but not identically distributed bits. Additionally, we present a method for identifying the optimal linear corrector for a given input min-entropy rate that maximizes the throughput of the post-processed bits while simultaneously achieving the needed security level. Our findings show that for an output min-entropy rate of $0.999$, the extraction efficiency of the linear correctors with the new bound can be up to $130.56\%$ higher when compared to the old bound, with an average improvement of $41.2\%$ over the entire input min-entropy range. On the other hand, the required min-entropy of the raw bits for the individual correctors can be reduced by up to $61.62\%$. △ Less

Submitted 19 October, 2023; v1 submitted 11 April, 2023; originally announced April 2023.

Comments: Final version after the review process. Accepted for publication in IEEE Transactions on Information Forensics and Security. Corrected typos

Journal ref: M. Grujić and I. Verbauwhede, "Optimizing Linear Correctors: A Tight Output Min-Entropy Bound and Selection Technique," in IEEE Transactions on Information Forensics and Security, vol. 19, pp. 586-600, 2024

arXiv:2212.05033 [pdf, ps, other]

Mining CryptoNight-Haven on the Varium C1100 Blockchain Accelerator Card

Authors: Lucas Bex, Furkan Turan, Michiel Van Beirendonck, Ingrid Verbauwhede

Abstract: Cryptocurrency mining is an energy-intensive process that presents a prime candidate for hardware acceleration. This work-in-progress presents the first coprocessor design for the ASIC-resistant CryptoNight-Haven Proof of Work (PoW) algorithm. We construct our hardware accelerator as a Xilinx Run Time (XRT) RTL kernel targeting the Xilinx Varium C1100 Blockchain Accelerator Card. The design employ… ▽ More Cryptocurrency mining is an energy-intensive process that presents a prime candidate for hardware acceleration. This work-in-progress presents the first coprocessor design for the ASIC-resistant CryptoNight-Haven Proof of Work (PoW) algorithm. We construct our hardware accelerator as a Xilinx Run Time (XRT) RTL kernel targeting the Xilinx Varium C1100 Blockchain Accelerator Card. The design employs deeply pipelined computation and High Bandwidth Memory (HBM) for the underlying scratchpad data. We aim to compare our accelerator to existing CPU and GPU miners to show increased throughput and energy efficiency of its hash computations △ Less

Submitted 9 December, 2022; originally announced December 2022.

arXiv:2211.13696 [pdf, other]

FPT: a Fixed-Point Accelerator for Torus Fully Homomorphic Encryption

Authors: Michiel Van Beirendonck, Jan-Pieter D'Anvers, Furkan Turan, Ingrid Verbauwhede

Abstract: Fully Homomorphic Encryption is a technique that allows computation on encrypted data. It has the potential to change privacy considerations in the cloud, but computational and memory overheads are preventing its adoption. TFHE is a promising Torus-based FHE scheme that relies on bootstrap**, the noise-removal tool invoked after each encrypted logical/arithmetical operation. We present FPT, a… ▽ More Fully Homomorphic Encryption is a technique that allows computation on encrypted data. It has the potential to change privacy considerations in the cloud, but computational and memory overheads are preventing its adoption. TFHE is a promising Torus-based FHE scheme that relies on bootstrap**, the noise-removal tool invoked after each encrypted logical/arithmetical operation. We present FPT, a Fixed-Point FPGA accelerator for TFHE bootstrap**. FPT is the first hardware accelerator to exploit the inherent noise present in FHE calculations. Instead of double or single-precision floating-point arithmetic, it implements TFHE bootstrap** entirely with approximate fixed-point arithmetic. Using an in-depth analysis of noise propagation in bootstrap** FFT computations, FPT is able to use noise-trimmed fixed-point representations that are up to 50% smaller than prior implementations. FPT is built as a streaming processor inspired by traditional streaming DSPs: it instantiates directly cascaded high-throughput computational stages, with minimal control logic and routing networks. We explore throughput-balanced compositions of streaming kernels with a user-configurable streaming width in order to construct a full bootstrap** pipeline. Our approach allows 100% utilization of arithmetic units and requires only a small bootstrap** key cache, enabling an entirely compute-bound bootstrap** throughput of 1 BS / 35us. This is in stark contrast to the classical CPU approach to FHE bootstrap** acceleration, which is typically constrained by memory and bandwidth. FPT is implemented and evaluated as a bootstrap** FPGA kernel for an Alveo U280 datacenter accelerator card. FPT achieves two to three orders of magnitude higher bootstrap** throughput than existing CPU-based implementations, and 2.5x higher throughput compared to recent ASIC emulation experiments. △ Less

Submitted 18 October, 2023; v1 submitted 24 November, 2022; originally announced November 2022.

Comments: ACM CCS 2023

arXiv:2205.14017 [pdf, other]

BASALISC: Programmable Hardware Accelerator for BGV Fully Homomorphic Encryption

Authors: Robin Geelen, Michiel Van Beirendonck, Hilder V. L. Pereira, Brian Huffman, Tynan McAuley, Ben Selfridge, Daniel Wagner, Georgios Dimou, Ingrid Verbauwhede, Frederik Vercauteren, David W. Archer

Abstract: Fully Homomorphic Encryption (FHE) allows for secure computation on encrypted data. Unfortunately, huge memory size, computational cost and bandwidth requirements limit its practicality. We present BASALISC, an architecture family of hardware accelerators that aims to substantially accelerate FHE computations in the cloud. BASALISC is the first to implement the BGV scheme with fully-packed bootstr… ▽ More Fully Homomorphic Encryption (FHE) allows for secure computation on encrypted data. Unfortunately, huge memory size, computational cost and bandwidth requirements limit its practicality. We present BASALISC, an architecture family of hardware accelerators that aims to substantially accelerate FHE computations in the cloud. BASALISC is the first to implement the BGV scheme with fully-packed bootstrap** -- the noise removal capability necessary for arbitrary-depth computation. It supports a customized version of bootstrap** that can be instantiated with hardware multipliers optimized for area and power. BASALISC is a three-abstraction-layer RISC architecture, designed for a 1 GHz ASIC implementation and underway toward 150mm2 die tape-out in a 12nm GF process. BASALISC's four-layer memory hierarchy includes a two-dimensional conflict-free inner memory layer that enables 32 Tb/s radix-256 NTT computations without pipeline stalls. Its conflict-resolution permutation hardware is generalized and re-used to compute BGV automorphisms without throughput penalty. BASALISC also has a custom multiply-accumulate unit to accelerate BGV key switching. The BASALISC toolchain comprises a custom compiler and a joint performance and correctness simulator. To evaluate BASALISC, we study its physical realizability, emulate and formally verify its core functional units, and we study its performance on a set of benchmarks. Simulation results show a speedup of more than 5,000 times over HElib -- a popular software FHE library. △ Less

Submitted 25 July, 2023; v1 submitted 27 May, 2022; originally announced May 2022.

arXiv:2201.07375 [pdf, other]

A 333.9uW 0.158mm$^2$ Saber Learning with Rounding based Post-Quantum Crypto Accelerator

Authors: Archisman Ghosh, J. M. B. Mera, Angshuman Karmakar, Debayan Das, Santosh Ghosh, Ingrid Verbauwhede, Shreyas Sen

Abstract: National Institute of Standard & Technology (NIST) is currently running a multi-year-long standardization procedure to select quantum-safe or post-quantum cryptographic schemes to be used in the future. Saber is the only LWR based algorithm to be in the final of Round 3. This work presents a Saber ASIC which provides 1.37X power-efficient, 1.75x lower area, and 4x less memory implementation w.r.t.… ▽ More National Institute of Standard & Technology (NIST) is currently running a multi-year-long standardization procedure to select quantum-safe or post-quantum cryptographic schemes to be used in the future. Saber is the only LWR based algorithm to be in the final of Round 3. This work presents a Saber ASIC which provides 1.37X power-efficient, 1.75x lower area, and 4x less memory implementation w.r.t. other SoA PQC ASIC. The energy-hungry multiplier block is 1.5x energyefficient than SoA. △ Less

Submitted 3 July, 2022; v1 submitted 18 January, 2022; originally announced January 2022.

arXiv:1908.03383 [pdf, other]

Advanced profiling for probabilistic Prime+Probe attacks and covert channels in ScatterCache

Authors: Antoon Purnal, Ingrid Verbauwhede

Abstract: Timing channels in cache hierarchies are an important enabler in many microarchitectural attacks. ScatterCache (USENIX 2019) is a protected cache architecture that randomizes the address-to-index map** with a keyed cryptographic function, aiming to thwart the usage of cache-based timing channels in microarchitectural attacks. In this note, we advance the understanding of the security of ScatterC… ▽ More Timing channels in cache hierarchies are an important enabler in many microarchitectural attacks. ScatterCache (USENIX 2019) is a protected cache architecture that randomizes the address-to-index map** with a keyed cryptographic function, aiming to thwart the usage of cache-based timing channels in microarchitectural attacks. In this note, we advance the understanding of the security of ScatterCache by outlining two attacks in the noise-free case, i.e. matching the assumptions in the original analysis. As a first contribution, we present more efficient eviction set profiling, reducing the required number of observable victim accesses (and hence profiling runtime) by several orders of magnitude. For instance, to construct a reliable eviction set in an 8-way set associative cache with 11 index bits, we relax victim access requirements from approximately $2^{25}$ to less than $2^{10}$ . As a second contribution, we demonstrate covert channel profiling and transmission in probabilistic caches like ScatterCache. By exploiting arbitrary collisions instead of targeted ones, our approach significantly outperforms known covert channels (e.g. full-cache eviction). △ Less

Submitted 9 August, 2019; originally announced August 2019.

arXiv:1706.07257 [pdf]

A survey of Hardware-based Control Flow Integrity (CFI)

Authors: Ruan de Clercq, Ingrid Verbauwhede

Abstract: CFI is a computer security technique that detects runtime attacks by monitoring a program's branching behavior. This work presents a detailed analysis of the security policies enforced by 21 recent hardware-based CFI architectures. The goal is to evaluate the security, limitations, hardware cost, performance, and practicality of using these policies. We show that many architectures are not suitabl… ▽ More CFI is a computer security technique that detects runtime attacks by monitoring a program's branching behavior. This work presents a detailed analysis of the security policies enforced by 21 recent hardware-based CFI architectures. The goal is to evaluate the security, limitations, hardware cost, performance, and practicality of using these policies. We show that many architectures are not suitable for widespread adoption, since they have practical issues, such as relying on accurate control flow model (which is difficult to obtain) or they implement policies which provide only limited security. △ Less

Submitted 31 July, 2017; v1 submitted 22 June, 2017; originally announced June 2017.

arXiv:0710.4806 [pdf]

A VLSI Design Flow for Secure Side-Channel Attack Resistant ICs

Authors: Kris Tiri, Ingrid Verbauwhede

Abstract: This paper presents a digital VLSI design flow to create secure, side-channel attack (SCA) resistant integrated circuits. The design flow starts from a normal design in a hardware description language such as VHDL or Verilog and provides a direct path to a SCA resistant layout. Instead of a full custom layout or an iterative design process with extensive simulations, a few key modifications are… ▽ More This paper presents a digital VLSI design flow to create secure, side-channel attack (SCA) resistant integrated circuits. The design flow starts from a normal design in a hardware description language such as VHDL or Verilog and provides a direct path to a SCA resistant layout. Instead of a full custom layout or an iterative design process with extensive simulations, a few key modifications are incorporated in a regular synchronous CMOS standard cell design flow. We discuss the basis for side-channel attack resistance and adjust the library databases and constraints files of the synthesis and place & route procedures accordingly. Experimental results show that a DPA attack on a regular single ended CMOS standard cell implementation of a module of the DES algorithm discloses the secret key after 200 measurements. The same attack on a secure version still does not disclose the secret key after more than 2000 measurements. △ Less

Submitted 25 October, 2007; originally announced October 2007.

Comments: Submitted on behalf of EDAA (http://www.edaa.com/)

Journal ref: Dans Design, Automation and Test in Europe | Designers'Forum - DATE'05, Munich : Allemagne (2005)

arXiv:0710.4756 [pdf]

Design Method for Constant Power Consumption of Differential Logic Circuits

Authors: Kris Tiri, Ingrid Verbauwhede

Abstract: Side channel attacks are a major security concern for smart cards and other embedded devices. They analyze the variations on the power consumption to find the secret key of the encryption algorithm implemented within the security IC. To address this issue, logic gates that have a constant power dissipation independent of the input signals, are used in security ICs. This paper presents a design m… ▽ More Side channel attacks are a major security concern for smart cards and other embedded devices. They analyze the variations on the power consumption to find the secret key of the encryption algorithm implemented within the security IC. To address this issue, logic gates that have a constant power dissipation independent of the input signals, are used in security ICs. This paper presents a design methodology to create fully connected differential pull down networks. Fully connected differential pull down networks are transistor networks that for any complementary input combination connect all the internal nodes of the network to one of the external nodes of the network. They are memoryless and for that reason have a constant load capacitance and power consumption. This type of networks is used in specialized logic gates to guarantee a constant contribution of the internal nodes into the total power consumption of the logic gate. △ Less

Submitted 25 October, 2007; originally announced October 2007.

Comments: Submitted on behalf of EDAA (http://www.edaa.com/)

Journal ref: Dans Design, Automation and Test in Europe - DATE'05, Munich : Allemagne (2005)

arXiv:0710.4646 [pdf]

Fast Dynamic Memory Integration in Co-Simulation Frameworks for Multiprocessor System on-Chip

Authors: O. Villa, P. Schaumont, I. Verbauwhede, M. Monchiero, G. Palermo

Abstract: In this paper is proposed a technique to integrate and simulate a dynamic memory in a multiprocessor framework based on C/C++/SystemC. Using host machine's memory management capabilities, dynamic data processing is supported without compromising speed and accuracy of the simulation. A first prototype in a shared memory context is presented. In this paper is proposed a technique to integrate and simulate a dynamic memory in a multiprocessor framework based on C/C++/SystemC. Using host machine's memory management capabilities, dynamic data processing is supported without compromising speed and accuracy of the simulation. A first prototype in a shared memory context is presented. △ Less

Submitted 25 October, 2007; originally announced October 2007.

Comments: Submitted on behalf of EDAA (http://www.edaa.com/)

Journal ref: Dans Design, Automation and Test in Europe - DATE'05, Munich : Allemagne (2005)

Showing 1–15 of 15 results for author: Verbauwhede, I