-
REED: Chiplet-Based Accelerator for Fully Homomorphic Encryption
Authors:
Aikata Aikata,
Ahmet Can Mert,
Sunmin Kwon,
Maxim Deryabin,
Sujoy Sinha Roy
Abstract:
Fully Homomorphic Encryption (FHE) enables privacy-preserving computation and has many applications. However, its practical implementation faces massive computation and memory overheads. To address this bottleneck, several Application-Specific Integrated Circuit (ASIC) FHE accelerators have been proposed. All these prior works put every component needed for FHE onto one chip (monolithic), hence of…
▽ More
Fully Homomorphic Encryption (FHE) enables privacy-preserving computation and has many applications. However, its practical implementation faces massive computation and memory overheads. To address this bottleneck, several Application-Specific Integrated Circuit (ASIC) FHE accelerators have been proposed. All these prior works put every component needed for FHE onto one chip (monolithic), hence offering high performance. However, they suffer from practical problems associated with large-scale chip design, such as inflexibility, low yield, and high manufacturing cost.
In this paper, we present the first-of-its-kind multi-chiplet-based FHE accelerator `REED' for overcoming the limitations of prior monolithic designs. To utilize the advantages of multi-chiplet structures while matching the performance of larger monolithic systems, we propose and implement several novel strategies in the context of FHE. These include a scalable chiplet design approach, an effective framework for workload distribution, a custom inter-chiplet communication strategy, and advanced pipelined Number Theoretic Transform and automorphism design to enhance performance.
Experimental results demonstrate that REED 2.5D microprocessor consumes 96.7 mm$^2$ chip area, 49.4 W average power in 7nm technology. It could achieve a remarkable speedup of up to 2,991x compared to a CPU (24-core 2xIntel X5690) and offer 1.9x better performance, along with a 50% reduction in development costs when compared to state-of-the-art ASIC FHE accelerators. Furthermore, our work presents the first instance of benchmarking an encrypted deep neural network (DNN) training. Overall, the REED architecture design offers a highly effective solution for accelerating FHE, thereby significantly advancing the practicality and deployability of FHE in real-world applications.
△ Less
Submitted 1 May, 2024; v1 submitted 5 August, 2023;
originally announced August 2023.
-
A Unified Cryptoprocessor for Lattice-based Signature and Key-exchange
Authors:
Aikata Aikata,
Ahmet Can Mert,
David Jacquemin,
Amitabh Das,
Donald Matthews,
Santosh Ghosh,
Sujoy Sinha Roy
Abstract:
We propose design methodologies for building a compact, unified and programmable cryptoprocessor architecture that computes post-quantum key agreement and digital signature. Synergies in the two types of cryptographic primitives are used to make the cryptoprocessor compact. As a case study, the cryptoprocessor architecture has been optimized targeting the signature scheme 'CRYSTALS-Dilithium' and…
▽ More
We propose design methodologies for building a compact, unified and programmable cryptoprocessor architecture that computes post-quantum key agreement and digital signature. Synergies in the two types of cryptographic primitives are used to make the cryptoprocessor compact. As a case study, the cryptoprocessor architecture has been optimized targeting the signature scheme 'CRYSTALS-Dilithium' and the key encapsulation mechanism (KEM) 'Saber', both finalists in the NIST's post-quantum cryptography standardization project. The programmable cryptoprocessor executes key generations, encapsulations, decapsulations, signature generations, and signature verifications for all the security levels of Dilithium and Saber. On a Xilinx Ultrascale+ FPGA, the proposed cryptoprocessor consumes 18,406 LUTs, 9,323 FFs, 4 DSPs, and 24 BRAMs. It achieves 200 MHz clock frequency and finishes CCA-secure key-generation/encapsulation/decapsulation operations for LightSaber in 29.6/40.4/58.3$μ$s; for Saber in 54.9/69.7/94.9$μ$s; and for FireSaber in 87.6/108.0/139.4$μ$s, respectively. It finishes key-generation/sign/verify operations for Dilithium-2 in 70.9/151.6/75.2$μ$s; for Dilithium-3 in 114.7/237/127.6$μ$s; and for Dilithium-5 in 194.2/342.1/228.9$μ$s, respectively, for the best-case scenario. On UMC 65nm library for ASIC the latency is improved by a factor of two due to a 2x increase in clock frequency.
△ Less
Submitted 13 October, 2022;
originally announced October 2022.
-
Medha: Microcoded Hardware Accelerator for computing on Encrypted Data
Authors:
Ahmet Can Mert,
Aikata,
Sunmin Kwon,
Youngsam Shin,
Donghoon Yoo,
Yongwoo Lee,
Sujoy Sinha Roy
Abstract:
Homomorphic encryption (HE) enables computation on encrypted data, and hence it has a great potential in privacy-preserving outsourcing of computations to the cloud. Hardware acceleration of HE is crucial as software implementations are very slow. In this paper, we present design methodologies for building a programmable hardware accelerator for speeding up the cloud-side homomorphic evaluations o…
▽ More
Homomorphic encryption (HE) enables computation on encrypted data, and hence it has a great potential in privacy-preserving outsourcing of computations to the cloud. Hardware acceleration of HE is crucial as software implementations are very slow. In this paper, we present design methodologies for building a programmable hardware accelerator for speeding up the cloud-side homomorphic evaluations on encrypted data. First, we propose a divide-and-conquer technique that enables homomorphic evaluations in a large polynomial ring $R_{Q,2N}$ to use a hardware accelerator that has been built for the smaller ring $R_{Q,N}$. The technique makes it possible to use a single hardware accelerator flexibly for supporting several HE parameter sets. Next, we present several architectural design methods that we use to realize the flexible and instruction-set accelerator architecture, which we call `Medha'. At every level of the implementation hierarchy, we explore possibilities for parallel processing. Starting from hardware-friendly parallel algorithms for the basic building blocks, we gradually build heavily parallel RNS polynomial arithmetic units. Next, many of these parallel units are interconnected elegantly so that their interconnections require the minimum number of nets, therefore making the overall architecture placement-friendly on the platform. For Medha, we take a memory-conservative design approach and get rid of any off-chip memory access during homomorphic evaluations. Finally, we implement Medha in a Xilinx Alveo U250 FPGA and measure timing performances of the microcoded homomorphic addition, multiplication, key-switching, and rescaling for the leveled HE scheme RNS-HEAAN at 200 MHz clock frequency. For two large parameter sets, Medha achieves accelerations by up to 68x and 78x times respectively compared to a highly optimized software implementation Microsoft SEAL running at 2.3 GHz.
△ Less
Submitted 12 October, 2022; v1 submitted 11 October, 2022;
originally announced October 2022.