PEANO-ViT: Power-Efficient Approximations of Non-Linearities in Vision Transformers

Mohammad Erfan Sadeghi University of Southern CaliforniaLos AngelesCaliforniaUSA90089 [email protected] Arash Fayyazi University of Southern CaliforniaLos AngelesCaliforniaUSA90089 [email protected] Seyedarmin Azizi University of Southern CaliforniaLos AngelesCaliforniaUSA90089 [email protected]  and  Massoud Pedram University of Southern CaliforniaLos AngelesCaliforniaUSA90089 [email protected]
(2024)
Abstract.

The deployment of Vision Transformers (ViTs) on hardware platforms, specially Field-Programmable Gate Arrays (FPGAs), presents many challenges, which are mainly due to the substantial computational and power requirements of their non-linear functions, notably layer normalization, softmax, and Gaussian Error Linear Unit (GELU). These critical functions pose significant obstacles to efficient hardware implementation due to their complex mathematical operations and the inherent resource count and architectural limitations of FPGAs. PEANO-ViT offers a novel approach to streamlining the implementation of the layer normalization layer by introducing a division-free technique that simultaneously approximates the division and square root function. Additionally, PEANO-ViT provides a multi-scale division strategy to eliminate division operations in the softmax layer, aided by a Padé-based approximation for the exponential function. Finally, PEANO-ViT introduces a piece-wise linear approximation for the GELU function, carefully designed to bypass the computationally intensive operations associated with GELU. In our comprehensive evaluations, PEANO-ViT exhibits minimal accuracy degradation (0.5%absentpercent0.5\leq 0.5\%≤ 0.5 % for DeiT-B) while significantly enhancing power efficiency, achieving improvements of 1.91×1.91\times1.91 ×, 1.39×1.39\times1.39 ×, and 8.01×8.01\times8.01 × for layer normalization, softmax, and GELU, respectively. This improvement is achieved through substantial reductions in DSP, LUT, and register counts for these non-linear operations. Consequently, PEANO-ViT enables efficient deployment of Vision Transformers on resource- and power-constrained FPGA platforms.

Vision Transformers, FPGA Implementation, Deep Learning Efficiency, Hardware Acceleration
copyright: nonejournalyear: 2024copyright: rightsretainedconference: Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design; August 5–7, 2024; Newport Beach, CA, USAbooktitle: Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED ’24), August 5–7, 2024, Newport Beach, CA, USAdoi: 10.1145/3665314.3670843isbn: 979-8-4007-0688-2/24/08ccs: Hardware High-level and register-transfer level synthesisccs: Computing methodologies Computer vision

1. Introduction

The landscape of computer vision has been fundamentally transformed with the advent of deep learning architectures, among which Vision Transformers (ViTs) (Dosovitskiy et al., 2021; Touvron et al., 2021; Liu et al., 2021b) have emerged as a particularly promising approach. Unlike traditional CNNs that use local receptive fields, ViTs use self-attention mechanisms to capture global dependencies in an image, enabling a more comprehensive understanding of visual data. This has positioned ViTs at the forefront of research, showing state-of-the-art performance across various computer vision tasks.

ViTs rely on a series of identical encoder blocks to progressively extract complex features from an image. These encoder blocks consist of two principal components: Multi-headed Attention (MHA) and Feed-Forward Network (FFN), each prefaced with a layer normalization block. Embedded within MHA and FFN are linear layers, GELU, and softmax, integrated via two residual connections that bookend the normalization stages. The output of the final encoder block goes through a classifier to obtain the class predictions.

Despite their exceptional performance, ViTs face significant challenges for practical deployment due to their extensive parameter count and considerable computational demands. A wide range of methods has been explored to improve the efficiency of ViTs, including approaches like quantization (Liu et al., 2021a), pruning (Yu et al., 2022), and low-rank approximations (Azizi et al., 2024). However, the deployment of ViTs in practical applications, especially on hardware platforms such as Field-Programmable Gate Arrays (FPGAs), presents fundamental challenges. Among these, the non-linear layers—layer normalization, softmax, and GELU—integral to the architecture of ViTs, stand out. While crucial for the network’s ability to model complex patterns, these functions are computationally intensive and thus present a critical challenge for the efficient implementation on FPGAs.

Our research delivers two key contributions. Firstly, we introduce PEANO-ViT, a novel approach that utilizes hardware-optimized approximation techniques for the non-linear functions within ViTs. Our approach in PEANO-ViT offers a comprehensive solution to the challenges posed by implementing key functions in ViTs on FPGA platforms. By leveraging innovative techniques such as the Padé-based approximation for the exponential function and incorporating bit manipulation operations for efficient division in the softmax layer, we strive for a well-balanced and resource-efficient implementation that prioritizes performance and resource conservation. The layer normalization implementation effectively tackles computational challenges by approximating the reciprocal of the square root, 1x1𝑥\frac{1}{\sqrt{x}}divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_x end_ARG end_ARG, in a novel manner. Furthermore, our adoption of a piece-wise linear approximation for GELU not only minimizes resource usage but also closely preserves the original function’s behavior. Secondly, we demonstrate through comprehensive experiments that PEANO-ViT enables the efficient execution of ViTs on FPGAs, with minimal impact on accuracy and significant improvements in computational efficiency and power consumption.

2. Related Work

Transformers (Vaswani et al., 2017), originally developed for tackling long sequences in natural language processing tasks, served as the inspiration behind ViT (Dosovitskiy et al., 2021) for computer vision applications. ViTs achieve impressive results by processing images as sequences of tokens and leveraging the power of self-attention. However, while crucial for performance, the core non-linear functions in ViTs – softmax, GELU, and layer normalization – are computationally expensive and hinder efficient hardware implementation. Several studies have explored hardware-efficient strategies for these layers, presenting various approximation techniques that balance approximation accuracy with computational cost. Their characteristic in comparison to PEANO-ViT is summarized in Table 1. The calculations for basic layer normalization, softmax, and GELU are depicted in equations (1-3), respectively. In equation 1, γ𝛾\gammaitalic_γ and β𝛽\betaitalic_β are learnable parameters while μ𝜇\muitalic_μ and σ𝜎\sigmaitalic_σ represents the average and variance of input data of the layer normalization function.

(1) LayerNorm(xi)=xiμσγ+βLayerNormsubscript𝑥𝑖subscript𝑥𝑖𝜇𝜎𝛾𝛽\text{LayerNorm}(x_{i})=\frac{x_{i}-\mu}{\sigma}*\gamma+\betaLayerNorm ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ end_ARG start_ARG italic_σ end_ARG ∗ italic_γ + italic_β
(2) Softmax(xi)=exiexiSoftmaxsubscript𝑥𝑖superscript𝑒subscript𝑥𝑖superscript𝑒subscript𝑥𝑖\text{Softmax}(x_{i})=\frac{e^{x_{i}}}{\sum{e^{x_{i}}}}Softmax ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG italic_e start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ italic_e start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG
(3) GELU(x)0.5x(1+tanh[2π(x+0.044715x3)])GELU𝑥0.5𝑥12𝜋𝑥0.044715superscript𝑥3\text{GELU}(x)\approx 0.5x\left(1+\tanh\left[\sqrt{\frac{2}{\pi}}\left(x+0.044% 715x^{3}\right)\right]\right)GELU ( italic_x ) ≈ 0.5 italic_x ( 1 + roman_tanh [ square-root start_ARG divide start_ARG 2 end_ARG start_ARG italic_π end_ARG end_ARG ( italic_x + 0.044715 italic_x start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) ] )
Table 1. Comparison of state-of-the-art methods for implementation of non-linear layers
Approach Layer normalization Softmax GELU All division-free Accuracy and resource aware
approximation approximation approximation approximations flexible approximations
Softermax (Stevens et al., 2021)
Koca et al.(Koca et al., 2023)
Peltekis et al.(Peltekis et al., 2024)
SOLE (Wang et al., 2023)
Li et al.(Li et al., 2023)
LTrans-OPU (Bai et al., 2023)
PEANO-ViT (Ours)

2.1. Softmax Implementations

The implementation of the softmax layer has emerged as a focal point of research, with numerous studies dedicated to optimizing its efficiency through various approximation techniques. The main challenges for an efficient implementation of softmax on hardware platforms arise from the non-linear function of exsuperscript𝑒𝑥e^{x}italic_e start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT and the final division operation for normalizing the output values. Previous research efforts, such as those by (Stevens et al., 2021) targeted the efficient calculation of exponential function and but were hindered by the costly division operation. In contrast, studies by (Koca et al., 2023), (Wang et al., 2023), and (Li et al., 2023) adopted bit manipulation techniques to simplify the exponential function approximation and eliminate the need for division. Although these methods are beneficial for reducing computational demands and are well-suited for hardware implementation, they still have a high computational complexity due to their inherently iterative nature, causing increased inference latency.

2.2. Layer Normalization Implementations

For hardware implementation of layer normalization, significant hurdles include the efficient approximation of the square root function and managing division operations. the approach introduced in (Wang et al., 2023) tackles the division operation issue but continues to employ the exact yet resource-intensive formula of square root, resulting in lower throughput.

2.3. GELU Implementations

Beyond layer normalization and softmax, the GELU function’s approximation also poses a significant challenge in the hardware deployment of ViTs. This is due to its intricate non-linear nature, which necessitates the execution of the tanh(x)𝑡𝑎𝑛𝑥tanh(x)italic_t italic_a italic_n italic_h ( italic_x ) function alongside polynomial calculations. Authors of (Li et al., 2023) have explored the approximation of the GELU function by simplifying the non-linear 2xsuperscript2𝑥2^{x}2 start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT function using bit manipulation operations. Additionally, (Peltekis et al., 2024) has presented an innovative method that leverages existing softmax hardware to facilitate GELU computations. While these approaches are designed to be hardware-efficient and minimize resource consumption, the computational latency remains a concern. This is due to the iterative nature of some of the bit manipulation operations in (Li et al., 2023), and the use of non-optimized hardware for GELU in (Peltekis et al., 2024).

3. Methodology

In this section, we describe the techniques utilized to approximate the layer normalization, softmax, and GELU functions. Our emphasis was on develo** methods that avoid divisions and ensure compatibility with hardware implementations while also aiming to preserve the accuracy of the model as much as possible.

3.1. Layer Normalization

As described in subsection 2.2, the main challenges of implementing layer normalization on hardware platforms such as FPGAs are the non-linear square root function and the costly division operation. Inspired by SOLE (Wang et al., 2023), we propose a method to directly approximate 1X1𝑋\frac{1}{\sqrt{X}}divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_X end_ARG end_ARG. We start with the following identities:

(4) 1X=2log21X,log21X=12log2Xformulae-sequence1𝑋superscript2subscript21𝑋subscript21𝑋12subscript2𝑋\frac{1}{\sqrt{X}}=2^{\log_{2}{\frac{1}{\sqrt{X}}}},\quad\log_{2}{\frac{1}{% \sqrt{X}}}=\frac{-1}{2}\log_{2}{X}divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_X end_ARG end_ARG = 2 start_POSTSUPERSCRIPT roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_X end_ARG end_ARG end_POSTSUPERSCRIPT , roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_X end_ARG end_ARG = divide start_ARG - 1 end_ARG start_ARG 2 end_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_X

Based on (Wang et al., 2023), we use equations (5-6) to approximate log2Xsubscript2𝑋\log_{2}{X}roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_X, in which kxsubscript𝑘𝑥k_{x}italic_k start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT is the leading ’1’ bit of X𝑋Xitalic_X and x[0,1)𝑥01x\in[0,1)italic_x ∈ [ 0 , 1 ):

(5) X=i=0n12ibi=2kx+i=0kx12ibi=2kx(1+x)𝑋superscriptsubscript𝑖0𝑛1superscript2𝑖subscript𝑏𝑖superscript2subscript𝑘𝑥superscriptsubscript𝑖0subscript𝑘𝑥1superscript2𝑖subscript𝑏𝑖superscript2subscript𝑘𝑥1𝑥X=\sum_{i=0}^{n-1}2^{i}b_{i}=2^{k_{x}}+\sum_{i=0}^{k_{x}-1}2^{i}b_{i}=2^{k_{x}% }(1+x)italic_X = ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n - 1 end_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 2 start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 2 start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( 1 + italic_x )
(6) log2Xkx+xsubscript2𝑋subscript𝑘𝑥𝑥\log_{2}{X}\approx k_{x}+xroman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_X ≈ italic_k start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + italic_x

Therefore, we can have the following approximation:

(7) 1X2(kx+x)21𝑋superscript2subscript𝑘𝑥𝑥2\frac{1}{\sqrt{X}}\approx 2^{\frac{-(k_{x}+x)}{2}}divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_X end_ARG end_ARG ≈ 2 start_POSTSUPERSCRIPT divide start_ARG - ( italic_k start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + italic_x ) end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT

Calculating the 2(kx+x)2superscript2subscript𝑘𝑥𝑥22^{\frac{-(k_{x}+x)}{2}}2 start_POSTSUPERSCRIPT divide start_ARG - ( italic_k start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + italic_x ) end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT term is the only step remaining. We note that 2α=2u2vsuperscript2𝛼superscript2𝑢superscript2𝑣2^{\alpha}=2^{u}*2^{v}2 start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT = 2 start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ∗ 2 start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT in which u𝑢uitalic_u is an integer number and v[0,1)𝑣01v\in[0,1)italic_v ∈ [ 0 , 1 ). To avoid calculating 2vsuperscript2𝑣2^{v}2 start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT, we keep the top m𝑚mitalic_m bits of v𝑣vitalic_v’s binary representation as vv~=(0.v1vm)2v\approx\tilde{v}=(0.v_{-1}\dots v_{-m})_{2}italic_v ≈ over~ start_ARG italic_v end_ARG = ( 0 . italic_v start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT … italic_v start_POSTSUBSCRIPT - italic_m end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and pre-store 2(0.00)2superscript2subscript0.0022^{(0.0\ldots 0)_{2}}2 start_POSTSUPERSCRIPT ( 0.0 … 0 ) start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT up to 2(0.11)2superscript2subscript0.1122^{(0.1\ldots 1)_{2}}2 start_POSTSUPERSCRIPT ( 0.1 … 1 ) start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Since u𝑢{u}italic_u is an integer number, 2usuperscript2𝑢2^{u}2 start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT can be implemented using the shift operation. Thus, the approximation of 1X1𝑋\frac{1}{\sqrt{X}}divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_X end_ARG end_ARG can be obtained from two equations below:

(8) 2(kx+x)2=2u2vsuperscript2subscript𝑘𝑥𝑥2superscript2𝑢superscript2𝑣2^{\frac{-(k_{x}+x)}{2}}=2^{u}\cdot 2^{v}2 start_POSTSUPERSCRIPT divide start_ARG - ( italic_k start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + italic_x ) end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT = 2 start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT ⋅ 2 start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT
(9) 1X2v~<<u1𝑋superscript2~𝑣much-less-than𝑢\frac{1}{\sqrt{X}}\approx 2^{\tilde{v}}<<{u}divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_X end_ARG end_ARG ≈ 2 start_POSTSUPERSCRIPT over~ start_ARG italic_v end_ARG end_POSTSUPERSCRIPT < < italic_u

Figure 1(b) shows the 1X1𝑋\frac{1}{\sqrt{X}}divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_X end_ARG end_ARG compared to our approximation and the overall layer normalization method is described in algorithm 1. Using these approximations, we have simultaneously tackled the two problems of efficient implementation of the square root function and approximating the division operation. It is important to highlight that m𝑚mitalic_m, an adjustable integer parameter, enables a trade-off between the precision of the approximation and the on-chip memory requirements for storing 2v~superscript2~𝑣2^{\tilde{v}}2 start_POSTSUPERSCRIPT over~ start_ARG italic_v end_ARG end_POSTSUPERSCRIPT. Increasing m𝑚mitalic_m improves the approximation accuracy at the cost of demanding more on-chip memory. This flexibility will be discussed in detail in Section 4.3.

Algorithm 1 PEANO Layer Normalization
1:x1,,xnsubscript𝑥1subscript𝑥𝑛x_{1},\ldots,x_{n}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, γ𝛾\gammaitalic_γ, β𝛽\betaitalic_β, fracPow2[2m]={2(0.00)2,,2(0.11)2}𝑓𝑟𝑎𝑐𝑃𝑜𝑤2delimited-[]superscript2𝑚superscript2subscript0.002superscript2subscript0.112fracPow2[2^{m}]=\{2^{(0.0\ldots 0)_{2}},\ldots,2^{(0.1\ldots 1)_{2}}\}italic_f italic_r italic_a italic_c italic_P italic_o italic_w 2 [ 2 start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ] = { 2 start_POSTSUPERSCRIPT ( 0.0 … 0 ) start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , … , 2 start_POSTSUPERSCRIPT ( 0.1 … 1 ) start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT }
2:y1,,ynsubscript𝑦1subscript𝑦𝑛y_{1},\ldots,y_{n}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT
3:Avg=1ni=1nxi𝐴𝑣𝑔1𝑛superscriptsubscript𝑖1𝑛subscript𝑥𝑖Avg=\frac{1}{n}\sum_{i=1}^{n}x_{i}italic_A italic_v italic_g = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT //average of inputs
4:AvgSQ=1ni=1nxi2𝐴𝑣𝑔𝑆𝑄1𝑛superscriptsubscript𝑖1𝑛superscriptsubscript𝑥𝑖2AvgSQ=\frac{1}{n}\sum_{i=1}^{n}x_{i}^{2}italic_A italic_v italic_g italic_S italic_Q = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT //average of inputs squared
5:Var=AvgSQAvg2𝑉𝑎𝑟𝐴𝑣𝑔𝑆𝑄𝐴𝑣superscript𝑔2Var=AvgSQ-Avg^{2}italic_V italic_a italic_r = italic_A italic_v italic_g italic_S italic_Q - italic_A italic_v italic_g start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT //variance of inputs
6:kVar=LeadingOne(Var)subscript𝑘𝑉𝑎𝑟𝐿𝑒𝑎𝑑𝑖𝑛𝑔𝑂𝑛𝑒𝑉𝑎𝑟k_{Var}=LeadingOne(Var)italic_k start_POSTSUBSCRIPT italic_V italic_a italic_r end_POSTSUBSCRIPT = italic_L italic_e italic_a italic_d italic_i italic_n italic_g italic_O italic_n italic_e ( italic_V italic_a italic_r ) //leading ’1’ bit of variance
7:xVar=Var[kVar1:0]x_{Var}=Var[k_{Var}-1:0]italic_x start_POSTSUBSCRIPT italic_V italic_a italic_r end_POSTSUBSCRIPT = italic_V italic_a italic_r [ italic_k start_POSTSUBSCRIPT italic_V italic_a italic_r end_POSTSUBSCRIPT - 1 : 0 ] //contains the bits after kVarsubscript𝑘𝑉𝑎𝑟k_{Var}italic_k start_POSTSUBSCRIPT italic_V italic_a italic_r end_POSTSUBSCRIPT
8:log2Approx=(kVar+xVar)>>1𝑙𝑜subscript𝑔2𝐴𝑝𝑝𝑟𝑜𝑥subscript𝑘𝑉𝑎𝑟subscript𝑥𝑉𝑎𝑟much-greater-than1log_{2}Approx={-(k_{Var}+x_{Var})}>>1italic_l italic_o italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_A italic_p italic_p italic_r italic_o italic_x = - ( italic_k start_POSTSUBSCRIPT italic_V italic_a italic_r end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT italic_V italic_a italic_r end_POSTSUBSCRIPT ) > > 1
9:u=log2Approx𝑢𝑙𝑜subscript𝑔2𝐴𝑝𝑝𝑟𝑜𝑥u=\lfloor log_{2}Approx\rflooritalic_u = ⌊ italic_l italic_o italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_A italic_p italic_p italic_r italic_o italic_x ⌋
10:v=ulog2Approx𝑣𝑢𝑙𝑜subscript𝑔2𝐴𝑝𝑝𝑟𝑜𝑥v=u-log_{2}Approxitalic_v = italic_u - italic_l italic_o italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_A italic_p italic_p italic_r italic_o italic_x
11:v~=fracBits(v,m)~𝑣𝑓𝑟𝑎𝑐𝐵𝑖𝑡𝑠𝑣𝑚\tilde{v}=fracBits(v,m)over~ start_ARG italic_v end_ARG = italic_f italic_r italic_a italic_c italic_B italic_i italic_t italic_s ( italic_v , italic_m ) //v~~𝑣\tilde{v}over~ start_ARG italic_v end_ARG keeps the top m fractional bits in v𝑣vitalic_v
12:recipSqrt=fracPow2[v~]<<u𝑟𝑒𝑐𝑖𝑝𝑆𝑞𝑟𝑡𝑓𝑟𝑎𝑐𝑃𝑜𝑤2delimited-[]~𝑣much-less-than𝑢recipSqrt=fracPow2[\tilde{v}]<<uitalic_r italic_e italic_c italic_i italic_p italic_S italic_q italic_r italic_t = italic_f italic_r italic_a italic_c italic_P italic_o italic_w 2 [ over~ start_ARG italic_v end_ARG ] < < italic_u //approximation of 1Var1𝑉𝑎𝑟\frac{1}{\sqrt{Var}}divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_V italic_a italic_r end_ARG end_ARG
13:for i=1𝑖1i=1italic_i = 1 to n𝑛nitalic_n do
14:     yi=(xiAvg)recipSqrtγ+βsubscript𝑦𝑖subscript𝑥𝑖𝐴𝑣𝑔𝑟𝑒𝑐𝑖𝑝𝑆𝑞𝑟𝑡𝛾𝛽y_{i}=(x_{i}-Avg)*recipSqrt*\gamma+\betaitalic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_A italic_v italic_g ) ∗ italic_r italic_e italic_c italic_i italic_p italic_S italic_q italic_r italic_t ∗ italic_γ + italic_β
15:end for
16:return y1,y2,,ynsubscript𝑦1subscript𝑦2subscript𝑦𝑛y_{1},y_{2},\ldots,y_{n}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT

3.2. Softmax

Our method for softmax approximation includes two steps. First, we introduce a Padé-based approximation for the exponential function. In the second step, we eliminate the division operations by proposing a multi-scale reciprocal approximation (MSR-approx) method. The Padé approximation Pade[m,n](x)=a0+a1x++am1xmb0+b1x++bn1xn𝑃𝑎𝑑subscript𝑒𝑚𝑛𝑥subscript𝑎0subscript𝑎1𝑥subscript𝑎𝑚1superscript𝑥𝑚subscript𝑏0subscript𝑏1𝑥subscript𝑏𝑛1superscript𝑥𝑛Pade_{[m,n]}(x)=\frac{a_{0}+a_{1}x+\ldots+a_{m-1}x^{m}}{b_{0}+b_{1}x+\ldots+b_% {n-1}x^{n}}italic_P italic_a italic_d italic_e start_POSTSUBSCRIPT [ italic_m , italic_n ] end_POSTSUBSCRIPT ( italic_x ) = divide start_ARG italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_x + … + italic_a start_POSTSUBSCRIPT italic_m - 1 end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_ARG start_ARG italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_x + … + italic_b start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_ARG of a function f(x)𝑓𝑥f(x)italic_f ( italic_x ) is the ratio of 2 polynomial functions. It represents a better approximation of an arbitrary nonlinear function compared to pure polynomial approximations of the same degree. For approximating the exsuperscript𝑒𝑥e^{x}italic_e start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT term, we have set m=n=2𝑚𝑛2m=n=2italic_m = italic_n = 2 to get a Padé approximation as follows:

(10) ex12+6x+x2126x+x2superscript𝑒𝑥126𝑥superscript𝑥2126𝑥superscript𝑥2e^{x}\approx\frac{12+6x+x^{2}}{12-6x+x^{2}}italic_e start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT ≈ divide start_ARG 12 + 6 italic_x + italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 12 - 6 italic_x + italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG

To compute the Pade[2,2](x)𝑃𝑎𝑑subscript𝑒22𝑥Pade_{[2,2]}(x)italic_P italic_a italic_d italic_e start_POSTSUBSCRIPT [ 2 , 2 ] end_POSTSUBSCRIPT ( italic_x ) approximation of exsuperscript𝑒𝑥e^{x}italic_e start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT, we only need to compute x2=xxsuperscript𝑥2𝑥𝑥x^{2}=x\cdot xitalic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_x ⋅ italic_x and 6x=x<<2+x<<16𝑥𝑥much-less-than2𝑥much-less-than16x=x<<2+x<<16 italic_x = italic_x < < 2 + italic_x < < 1 thanks to the numerator and denominator having similar functional forms. Figure 1(a) illustrates the Pade-based approximation of the function compared to exsuperscript𝑒𝑥e^{x}italic_e start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT. As can be seen, the proposed approximation is very accurate for x[3,2]𝑥32x\in[-3,2]italic_x ∈ [ - 3 , 2 ]. This observation motivated us first to add 2 to all inputs (after subtracting the maximum value) and then set exsuperscript𝑒𝑥e^{x}italic_e start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT to 00 for the values of less than 33-3- 3 after the first step’s calculations. Our final approximation of the exponential function is thus as follows:

(11) PEANOexp(x~)={0if x~<312+6x~+x~2126x~+x~2if x~3𝑃𝐸𝐴𝑁𝑂𝑒𝑥𝑝~𝑥cases0if ~𝑥3126~𝑥superscript~𝑥2126~𝑥superscript~𝑥2if ~𝑥3PEANOexp(\tilde{x})=\begin{cases}0&\text{if }\tilde{x}<-3\\ \frac{12+6\tilde{x}+\tilde{x}^{2}}{12-6\tilde{x}+\tilde{x}^{2}}&\text{if }% \tilde{x}\geq-3\end{cases}italic_P italic_E italic_A italic_N italic_O italic_e italic_x italic_p ( over~ start_ARG italic_x end_ARG ) = { start_ROW start_CELL 0 end_CELL start_CELL if over~ start_ARG italic_x end_ARG < - 3 end_CELL end_ROW start_ROW start_CELL divide start_ARG 12 + 6 over~ start_ARG italic_x end_ARG + over~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 12 - 6 over~ start_ARG italic_x end_ARG + over~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_CELL start_CELL if over~ start_ARG italic_x end_ARG ≥ - 3 end_CELL end_ROW

Where x~=xmax(xi)+2~𝑥𝑥𝑚𝑎𝑥subscript𝑥𝑖2\tilde{x}=x-max(x_{i})+2over~ start_ARG italic_x end_ARG = italic_x - italic_m italic_a italic_x ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + 2. The above approximation adds another division operation to the main calculation of softmax. The first division is for the computation of PEANOexp(x)𝑃𝐸𝐴𝑁𝑂𝑒𝑥𝑝𝑥PEANOexp(x)italic_P italic_E italic_A italic_N italic_O italic_e italic_x italic_p ( italic_x ) while the second division is needed for the softmax’s output normalization. Since x~[3,2]~𝑥32\tilde{x}\in[-3,2]over~ start_ARG italic_x end_ARG ∈ [ - 3 , 2 ], values of the PEANOexp(x)𝑃𝐸𝐴𝑁𝑂𝑒𝑥𝑝𝑥PEANOexp(x)italic_P italic_E italic_A italic_N italic_O italic_e italic_x italic_p ( italic_x )’s denominator lie in the interval of [4,39]439[4,39][ 4 , 39 ]. This motivated us to pre-store some of 1x1𝑥\frac{1}{x}divide start_ARG 1 end_ARG start_ARG italic_x end_ARG values and subsequently use them to approximate the reciprocal function. However, unlike the denominator of PEANOexp(x)𝑃𝐸𝐴𝑁𝑂𝑒𝑥𝑝𝑥PEANOexp(x)italic_P italic_E italic_A italic_N italic_O italic_e italic_x italic_p ( italic_x ), the denominator of the second division has a huge range of values. Therefore, pre-storing values to approximate the second division is not feasible (unless a very large lookup table is used, which would result in high memory usage.)

To solve the aforesaid problem, we propose a multi-scale reciprocal approximation (MSR-approx) scheme for both division operations in the softmax. First we replace X𝑋Xitalic_X (the denominator) with X~~𝑋\tilde{X}over~ start_ARG italic_X end_ARG using the equation below:

(12) X~=ScaleXScale~𝑋𝑆𝑐𝑎𝑙𝑒𝑋𝑆𝑐𝑎𝑙𝑒\tilde{X}=Scale\cdot\lfloor\frac{X}{Scale}\rfloorover~ start_ARG italic_X end_ARG = italic_S italic_c italic_a italic_l italic_e ⋅ ⌊ divide start_ARG italic_X end_ARG start_ARG italic_S italic_c italic_a italic_l italic_e end_ARG ⌋

And the reciprocal function approximation is described as,

(13) 1X1X~=1Scale1XScale1𝑋1~𝑋1𝑆𝑐𝑎𝑙𝑒1𝑋𝑆𝑐𝑎𝑙𝑒\frac{1}{X}\approx\frac{1}{\tilde{X}}=\frac{1}{Scale}\cdot\frac{1}{\lfloor% \frac{X}{Scale}\rfloor}divide start_ARG 1 end_ARG start_ARG italic_X end_ARG ≈ divide start_ARG 1 end_ARG start_ARG over~ start_ARG italic_X end_ARG end_ARG = divide start_ARG 1 end_ARG start_ARG italic_S italic_c italic_a italic_l italic_e end_ARG ⋅ divide start_ARG 1 end_ARG start_ARG ⌊ divide start_ARG italic_X end_ARG start_ARG italic_S italic_c italic_a italic_l italic_e end_ARG ⌋ end_ARG

Next, we force Scale=2α𝑆𝑐𝑎𝑙𝑒superscript2𝛼Scale=2^{\alpha}italic_S italic_c italic_a italic_l italic_e = 2 start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT to be an integer power of 2 so that 1Scale1𝑆𝑐𝑎𝑙𝑒\frac{1}{Scale}divide start_ARG 1 end_ARG start_ARG italic_S italic_c italic_a italic_l italic_e end_ARG can be implemented by using a right shift by α𝛼\alphaitalic_α. This constraint also helps with the calculation of XScale𝑋𝑆𝑐𝑎𝑙𝑒\lfloor\frac{X}{Scale}\rfloor⌊ divide start_ARG italic_X end_ARG start_ARG italic_S italic_c italic_a italic_l italic_e end_ARG ⌋ since it simply means drop** out the α𝛼\alphaitalic_α right bits of X𝑋Xitalic_X. The only thing we need to do is to pre-store XScale𝑋𝑆𝑐𝑎𝑙𝑒\lfloor\frac{X}{Scale}\rfloor⌊ divide start_ARG italic_X end_ARG start_ARG italic_S italic_c italic_a italic_l italic_e end_ARG ⌋ values, which is still problematic due to the fact that the range of X𝑋Xitalic_X can be extremely wide for the second division operation. This arises from the assumption of fixed α𝛼\alphaitalic_α for all X values while using a dynamic value of α𝛼\alphaitalic_α will solve the problem of X𝑋Xitalic_X’s large variable range as described in algorithm 2.

Algorithm 2 shows the multi-scale approximation of the reciprocal function, which uses an adjustable integer threshold αsuperscript𝛼\alpha^{*}italic_α start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and pre-stored values of {11,,12α+11}111superscript2superscript𝛼11\{\frac{1}{1},\ldots,\frac{1}{2^{\alpha^{*}+1}-1}\}{ divide start_ARG 1 end_ARG start_ARG 1 end_ARG , … , divide start_ARG 1 end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + 1 end_POSTSUPERSCRIPT - 1 end_ARG }. The MSR-approx maps all values of X𝑋Xitalic_X into the interval of [1,2α+11]1superscript2superscript𝛼11[1,2^{\alpha^{*}+1}-1][ 1 , 2 start_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + 1 end_POSTSUPERSCRIPT - 1 ] via defining a flexible Scale value, which solves the problem of the dynamic range of X𝑋Xitalic_X. For instance, if α=4superscript𝛼4\alpha^{*}=4italic_α start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 4 then for X[1,31]𝑋131X\in[1,31]italic_X ∈ [ 1 , 31 ] then XScale{1,,31}𝑋𝑆𝑐𝑎𝑙𝑒131\lfloor\frac{X}{Scale}\rfloor\in\{1,\dots,31\}⌊ divide start_ARG italic_X end_ARG start_ARG italic_S italic_c italic_a italic_l italic_e end_ARG ⌋ ∈ { 1 , … , 31 }, and XScale{16,,31}𝑋𝑆𝑐𝑎𝑙𝑒1631\lfloor\frac{X}{Scale}\rfloor\in\{16,\dots,31\}⌊ divide start_ARG italic_X end_ARG start_ARG italic_S italic_c italic_a italic_l italic_e end_ARG ⌋ ∈ { 16 , … , 31 } for the other values of X𝑋Xitalic_X. Hence, we only need to pre-store {11,,131}11131\{\frac{1}{1},\ldots,\frac{1}{31}\}{ divide start_ARG 1 end_ARG start_ARG 1 end_ARG , … , divide start_ARG 1 end_ARG start_ARG 31 end_ARG }. Figure 1(c) illustrates our MSR-approx method compared to original reciprocal function for α=4superscript𝛼4\alpha^{*}=4italic_α start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 4. Choosing αsuperscript𝛼\alpha^{*}italic_α start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is a trade-off between the accuracy of MSR-approx and the memory required for pre-storing values (see Section 4.3). Larger αsuperscript𝛼\alpha^{*}italic_α start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT proposes a more accurate approximation of reciprocal function while requiring larger memory for pre-stored values. The softmax using the MSR-approx scheme is presented in algorithm 3.

An alternative approach for improving the accuracy of the multi-scale division is to use linear interpolation between pre-stored points (instead of directly using any of these points.) For instance, if X=59𝑋59X=59italic_X = 59 and α=4superscript𝛼4\alpha^{*}=4italic_α start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 4, the scale is equal to 2, so in the basic MSR-approx method, we approximate 159159\frac{1}{59}divide start_ARG 1 end_ARG start_ARG 59 end_ARG using 1592=1291592129\frac{1}{\lfloor\frac{59}{2}\rfloor}=\frac{1}{29}divide start_ARG 1 end_ARG start_ARG ⌊ divide start_ARG 59 end_ARG start_ARG 2 end_ARG ⌋ end_ARG = divide start_ARG 1 end_ARG start_ARG 29 end_ARG. Instead, we can do linear interpolation between 129129\frac{1}{29}divide start_ARG 1 end_ARG start_ARG 29 end_ARG and 130130\frac{1}{30}divide start_ARG 1 end_ARG start_ARG 30 end_ARG to have a more accurate approximation of 159159\frac{1}{59}divide start_ARG 1 end_ARG start_ARG 59 end_ARG. The MSR approximation enhanced with linear interpolation (called LMSR-approx) attains superior accuracy at the expense of a slight increased resource consumption and computational cycles, illustrating a clear trade-off between accuracy and resource efficiency.

Refer to caption
(a) Exponential func.
Refer to caption
(b) Reciprocal square root func.
Refer to caption
(c) Reciprocal func.
Refer to caption
(d) GELU func.
Figure 1. Comparison of standard functions with our approximations.

3.3. GELU

PEANO-ViT uses a piece-wise linear approach to approximate the Gaussian Error Linear Unit (GELU). Unlike ViT’s other non-linear functions, such as the square root and exponential functions, GELU exhibits a predominantly linear behavior across both the lower and upper extremes of its domain. Additionally, the GELU activation function maintains a narrow range of values within its non-linear region. These characteristics motivate the adoption of a piece-wise linear approximation as a highly suitable method for replicating the functionality of the GELU function.

Our method employs six breakpoints for GELU computations, resulting in seven linear segments. The initial breakpoints are set at x=3𝑥3x=-3italic_x = - 3 and x=3𝑥3x=3italic_x = 3, chosen to emulate the GELU’s linear behavior as x𝑥xitalic_x approaches ±plus-or-minus\pm\infty± ∞. Importantly, like many established activation functions (e.g., ReLU, PReLU, GELU, SiLU), our approximation ensures that the activation function intersects the origin, introducing a third breakpoint at x=0𝑥0x=0italic_x = 0. To capture GELU’s capability for generating negative outputs, a breakpoint at x=0.75𝑥0.75x=-0.75italic_x = - 0.75 approximates its minimum value, enhancing the fidelity of our approximation. To optimize the representation of GELU’s transitional non-linear behavior within the intervals [3,0.75]30.75[-3,-0.75][ - 3 , - 0.75 ] and [0,3]03[0,3][ 0 , 3 ], additional breakpoints at x=2.1𝑥2.1x=-2.1italic_x = - 2.1 and x=0.5𝑥0.5x=0.5italic_x = 0.5 are introduced. These points were determined through the minimization of the mean square error, ensuring a more accurate approximation in the specified ranges. With the mentioned breakpoints, figure 1(d) visualizes our final approximation which is described in the equation below:

PEANOGELU(x)={0if x<30.0414(x+3)if 3x<2.10.0982(x+2.1)0.0373if 2.1x<0.750.2266(x+0.75)0.17if 0.75x<00.6914xif 0x<0.51.0617(x0.5)+0.3457if 0.5x<3xif x3𝑃𝐸𝐴𝑁𝑂𝐺𝐸𝐿𝑈𝑥cases0if 𝑥30.0414𝑥3if 3𝑥2.10.0982𝑥2.10.0373if 2.1𝑥0.750.2266𝑥0.750.17if 0.75𝑥00.6914𝑥if 0𝑥0.51.0617𝑥0.50.3457if 0.5𝑥3𝑥if 𝑥3\scriptsize PEANO-GELU(x)=\left\{\begin{array}[]{ll}0&\quad\text{if }x<-3\\ -0.0414(x+3)&\quad\text{if }-3\leq x<-2.1\\ -0.0982(x+2.1)-0.0373&\quad\text{if }-2.1\leq x<-0.75\\ 0.2266(x+0.75)-0.17&\quad\text{if }-0.75\leq x<0\\ 0.6914x&\quad\text{if }0\leq x<0.5\\ 1.0617(x-0.5)+0.3457&\quad\text{if }0.5\leq x<3\\ x&\quad\text{if }x\geq 3\end{array}\right.italic_P italic_E italic_A italic_N italic_O - italic_G italic_E italic_L italic_U ( italic_x ) = { start_ARRAY start_ROW start_CELL 0 end_CELL start_CELL if italic_x < - 3 end_CELL end_ROW start_ROW start_CELL - 0.0414 ( italic_x + 3 ) end_CELL start_CELL if - 3 ≤ italic_x < - 2.1 end_CELL end_ROW start_ROW start_CELL - 0.0982 ( italic_x + 2.1 ) - 0.0373 end_CELL start_CELL if - 2.1 ≤ italic_x < - 0.75 end_CELL end_ROW start_ROW start_CELL 0.2266 ( italic_x + 0.75 ) - 0.17 end_CELL start_CELL if - 0.75 ≤ italic_x < 0 end_CELL end_ROW start_ROW start_CELL 0.6914 italic_x end_CELL start_CELL if 0 ≤ italic_x < 0.5 end_CELL end_ROW start_ROW start_CELL 1.0617 ( italic_x - 0.5 ) + 0.3457 end_CELL start_CELL if 0.5 ≤ italic_x < 3 end_CELL end_ROW start_ROW start_CELL italic_x end_CELL start_CELL if italic_x ≥ 3 end_CELL end_ROW end_ARRAY

3.4. FPGA Implementation

The overall FPGA implementation of PEANO-ViT’s non-linear layers is illustrated in Figure 2. Notably, each non-linear function processes N𝑁Nitalic_N elements concurrently, enabling an approximate Nfold𝑁𝑓𝑜𝑙𝑑N-folditalic_N - italic_f italic_o italic_l italic_d reduction in computation time. To enhance processing speed further, FIFO queues have been integrated between the reading, storing, and computing stages across all three implementations. Distinct from GELU, both layer normalization and softmax necessitate dual readings of input data—the initial for preliminary calculations and the subsequent for the normalization phase. Integrating an extra FIFO in parallel to the primary data stream notably decreases the latency for both the layer normalization and softmax modules by eliminating the requirement to temporarily store input values for a second calculation phase. Increasing the parameter N𝑁Nitalic_N accelerates the processing of non-linear functions at the cost of more FPGA resource consumption. Consequently, PEANO-ViT becomes a configurable hardware framework alongside its software flexibilities.

Refer to caption
Figure 2. Overall FPGA implementation of PEANO-ViT
Algorithm 2 Multi-Scale Reciprocal approximation (MSR-approx)
1:x,α,StoredRecip[2α+11]={11,,12α+11}𝑥superscript𝛼𝑆𝑡𝑜𝑟𝑒𝑑𝑅𝑒𝑐𝑖𝑝delimited-[]superscript2superscript𝛼11111superscript2superscript𝛼11x,\alpha^{*},StoredRecip[2^{\alpha^{*}+1}-1]=\{\frac{1}{1},\ldots,\frac{1}{2^{% \alpha^{*}+1}-1}\}italic_x , italic_α start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_S italic_t italic_o italic_r italic_e italic_d italic_R italic_e italic_c italic_i italic_p [ 2 start_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + 1 end_POSTSUPERSCRIPT - 1 ] = { divide start_ARG 1 end_ARG start_ARG 1 end_ARG , … , divide start_ARG 1 end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT + 1 end_POSTSUPERSCRIPT - 1 end_ARG }
2:y𝑦yitalic_y //approximation of 1x1𝑥\frac{1}{x}divide start_ARG 1 end_ARG start_ARG italic_x end_ARG
3:logInterval=LeadingOne(x)𝑙𝑜𝑔𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙𝐿𝑒𝑎𝑑𝑖𝑛𝑔𝑂𝑛𝑒𝑥logInterval=LeadingOne(x)italic_l italic_o italic_g italic_I italic_n italic_t italic_e italic_r italic_v italic_a italic_l = italic_L italic_e italic_a italic_d italic_i italic_n italic_g italic_O italic_n italic_e ( italic_x )
4:if logIntervalα𝑙𝑜𝑔𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙superscript𝛼logInterval\leq\alpha^{*}italic_l italic_o italic_g italic_I italic_n italic_t italic_e italic_r italic_v italic_a italic_l ≤ italic_α start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT then
5:     α=0𝛼0\alpha=0italic_α = 0
6:else
7:     α=logIntervalα𝛼𝑙𝑜𝑔𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙superscript𝛼\alpha=logInterval-\alpha^{*}italic_α = italic_l italic_o italic_g italic_I italic_n italic_t italic_e italic_r italic_v italic_a italic_l - italic_α start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT
8:end if
9:Scale=2α𝑆𝑐𝑎𝑙𝑒superscript2𝛼Scale=2^{\alpha}italic_S italic_c italic_a italic_l italic_e = 2 start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT
10:y=(StoredRecip[x>>α])>>α𝑦𝑆𝑡𝑜𝑟𝑒𝑑𝑅𝑒𝑐𝑖𝑝delimited-[]much-greater-than𝑥𝛼much-greater-than𝛼y=({StoredRecip[\lfloor x>>\alpha\rfloor]})>>\alphaitalic_y = ( italic_S italic_t italic_o italic_r italic_e italic_d italic_R italic_e italic_c italic_i italic_p [ ⌊ italic_x > > italic_α ⌋ ] ) > > italic_α
11:return y𝑦yitalic_y
Algorithm 3 PEANO Softmax
1:x1,,xnsubscript𝑥1subscript𝑥𝑛x_{1},\ldots,x_{n}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT
2:y1,,ynsubscript𝑦1subscript𝑦𝑛y_{1},\ldots,y_{n}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT
3:MaxInput=max(xi)𝑀𝑎𝑥𝐼𝑛𝑝𝑢𝑡𝑚𝑎𝑥subscript𝑥𝑖MaxInput=max(x_{i})italic_M italic_a italic_x italic_I italic_n italic_p italic_u italic_t = italic_m italic_a italic_x ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) //Maximum of inputs
4:x~i=xiMaxInput+2subscript~𝑥𝑖subscript𝑥𝑖𝑀𝑎𝑥𝐼𝑛𝑝𝑢𝑡2\tilde{x}_{i}=x_{i}-MaxInput+2over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_M italic_a italic_x italic_I italic_n italic_p italic_u italic_t + 2 //Shifting inputs by 2 - MaxInput
5:for i=1𝑖1i=1italic_i = 1 to n𝑛nitalic_n do
6:     if x~i<3subscript~𝑥𝑖3\tilde{x}_{i}<-3over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < - 3 then
7:         PEANOexpi=0𝑃𝐸𝐴𝑁𝑂𝑒𝑥subscript𝑝𝑖0{PEANOexp_{i}}=0italic_P italic_E italic_A italic_N italic_O italic_e italic_x italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0
8:     else
9:         PEANOexpi=(12+6x~i+x~i2)𝑃𝐸𝐴𝑁𝑂𝑒𝑥subscript𝑝𝑖126subscript~𝑥𝑖superscriptsubscript~𝑥𝑖2{PEANOexp_{i}}=(12+6\tilde{x}_{i}+\tilde{x}_{i}^{2})italic_P italic_E italic_A italic_N italic_O italic_e italic_x italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( 12 + 6 over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
10:      ×\times× MSR-approx(126x~i+6x~i2)126subscript~𝑥𝑖6superscriptsubscript~𝑥𝑖2(12-6\tilde{x}_{i}+6\tilde{x}_{i}^{2})( 12 - 6 over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 6 over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
11:     end if
12:end for
13:Sum=i=1nPEANOexpi𝑆𝑢𝑚superscriptsubscript𝑖1𝑛𝑃𝐸𝐴𝑁𝑂𝑒𝑥subscript𝑝𝑖Sum=\sum_{i=1}^{n}PEANOexp_{i}italic_S italic_u italic_m = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_P italic_E italic_A italic_N italic_O italic_e italic_x italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT //Summation of exponential terms
14:for i=1𝑖1i=1italic_i = 1 to n𝑛nitalic_n do
15:     yi=PEANOexpi×y_{i}=PEANOexp_{i}\timesitalic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_P italic_E italic_A italic_N italic_O italic_e italic_x italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ×MSR-approx(Sum)𝑆𝑢𝑚(Sum)( italic_S italic_u italic_m )
16:end for
17:return y1,,ynsubscript𝑦1subscript𝑦𝑛y_{1},\ldots,y_{n}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT
Table 2. Accuracy Loss of approximations on ImageNet-1K benchmark. The results of (Wang et al., 2023) and (Li et al., 2023), if available, are directly sourced from the papers. FP32 and FiP16 stand for 32-bit floating-point and 16-bit fixed-point, respectively.
Model Approach Approximations Accuracy
DeiT-S Baseline(FP32) - 79.85%
SOLE (Wang et al., 2023)(FP32) Layer normalization + softmax 79.27%
PEANO-ViT(Ours)(FP32) Layer normalization + softmax 79.36%
PEANO-ViT(Ours)(FiP16) All non-linearities 79.13%
DeiT-B Baseline(FP32) - 81.85%
SOLE (Wang et al., 2023)(FP32) Layer normalization + softmax 81.60%
PEANO-ViT(Ours)(FP32) Layer normalization + softmax 81.55%
PEANO-ViT(Ours)(FiP16) All non-linearities 81.35%
PEANO-ViT(Ours) W LMSR-approx(FiP16) All non-linearities 81.65%
Swin-B Baseline(FP32) - 83.60%
SOLE (Wang et al., 2023)(FP32) Layer normalization + softmax 83.05%
PEANO-ViT(Ours)(FP32) Layer normalization + softmax 83.60%
PEANO-ViT(Ours)(FiP16) All non-linearities 83.56%
ViT-L Baseline(FP32) - 85.15%
Li et al.(Li et al., 2023)(FiP16) Softmax + GELU 84.78%
PEANO-ViT(Ours)(FiP16) Softmax + GELU 85.03%
PEANO-ViT(Ours)(FiP16) All non-linearities 84.83%

4. Results and Discussions

In this study, the PEANO-ViT model was implemented on a Xilinx UltraScale+ VU9P board running at a frequency of 250 MHz. We utilized the Vivado power report from Xilinx to evaluate the power consumption of each design. To evaluate the performance of PEANO-ViT, we employed the publicly available ImageNet-1K dataset (Deng et al., 2009) and three different model architectures, namely ViT (Dosovitskiy et al., 2021), DeiT (Touvron et al., 2021) and Swin (Liu et al., 2021b), across various sizes (small, base, and large). It is important to point out that our experimental setup does not require extensive retraining. Instead, we conducted only two epochs of fine-tuning after integrating each approximation into the model. We utilized pre-trained models from the TIMM library (Wightman, 2019) as our starting point and implemented our approximations using PyTorch.

Table 3. Hardware metrics for DeiT-B Implementation
Non-linear layer Approach DSP DSP (Reduction) LUT LUT (Reduction) Register Register (Reduction) Power efficiency
Layer normalization Standard layer normalization 51 - 24609 - 29831 - 1×1\times1 ×
LTrans-OPU (Bai et al., 2023) 0 100% 60902 -147.4% 7850 73.6% 0.99×0.99\times0.99 ×
PEANO layer normalization (Ours) 52 -1.9% 8157 66.8% 8621 71.1% 1.91×\boldsymbol{1.91\times}bold_1.91 bold_×
Softmax Standard softmax 64 - 9745 - 10648 - 1×1\times1 ×
LTrans-OPU (Bai et al., 2023) 0 100% 238569 -2348.1% 13837 -29.9% 0.19×0.19\times0.19 ×
PEANO softmax W MSR-approx (Ours) 48 25% 5595 42.5% 3831 64% 1.39×\boldsymbol{1.39\times}bold_1.39 bold_×
PEANO softmax W LMSR-approx (Ours) 49 23.4% 5741 41.1% 3876 63.6% 1.38×1.38\times1.38 ×
GELU Standard GELU 128 - 101267 - 88293 - 1×1\times1 ×
LTrans-OPU (Bai et al., 2023) 0 100% 11314 88.8% 2499 97.1% 6.76×6.76\times6.76 ×
PEANO GELU (Ours) 16 87.5% 2940 97.1% 2951 96.6% 8.01×\boldsymbol{8.01\times}bold_8.01 bold_×

4.1. ImageNet Classification

Table 2 provides a comparison of accuracy losses for four ViT-based models utilizing the PEANO-ViT approximations against techniques proposed by (Wang et al., 2023) and (Li et al., 2023) implemented on FPGA and GPU platforms, respectively. In our analysis, we set the layer normalization parameter m=4𝑚4m=4italic_m = 4 and the MSR-approximation parameter α=4superscript𝛼4\alpha^{*}=4italic_α start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 4 without any linear interpolation. The superior performance of PEANO-ViT compared to (Li et al., 2023) and (Wang et al., 2023) stems from its independent approximations of the softmax, GELU, and layer normalization functions, while (Li et al., 2023) focuses solely on softmax and GELU, and (Wang et al., 2023) on layer normalization and softmax. The results of Table 2 indicate that PEANO-ViT exhibits minimal accuracy degradation when applying approximations to all non-linear blocks. Furthermore, when using a similar approximation approach, PEANO-ViT achieves lower accuracy reduction across DeiT-S, Swin-B, and ViT-L models compared to the methods outlined in (Li et al., 2023) and SOLE (Wang et al., 2023). For the DeiT-B model, PEANO-ViT shows reduced accuracy degradation compared to SOLE (Wang et al., 2023) when switching from MSR-approximation to LMSR-approximation. Notably, PEANO-ViT offers the ability to further minimize accuracy loss by adjusting m𝑚mitalic_m and αsuperscript𝛼\alpha^{*}italic_α start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and by incorporating linear interpolation in the MSR approximation (LMSR-approx).

4.2. Hardware Cost

Table 3 details the power efficiency gain and reduction in resource usage achieved by implementing PEANO-ViT. By utilizing the rapid and hardware-compatible approximations introduced by PEANO-ViT, the significant power consumption and resource usage associated with hardware-intensive and costly iterative methods for exact non-linear implementation have been greatly diminished. Furthermore, Table 3 provides the resource utilization breakdown for each non-linear layer of PEANO-ViT. In processing layers such as normalization, softmax, and GELU, we simultaneously handle 16 elements, resulting in a Level of Parallelism (LoP) of 16 to enable a fair comparison with LTrans-OPU. This LoP can be adjusted to align with resource availability and latency objectives, making PEANO-ViT a versatile framework for enhancing the speed of machine learning tasks. Increasing the LoP enhances processing speed but may lead to higher resource consumption and power usage.

4.3. Flexibility of PEANO-ViT

PEANO-ViT is a highly versatile framework that can be tailored to meet specific accuracy goals, hardware resource limitations, and power consumption requirements. This adaptability is achieved through the adjustment of key parameters such as m𝑚mitalic_m for layer normalization, αsuperscript𝛼\alpha^{*}italic_α start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT for softmax, and the selection between MSR or LMSR approximations for softmax. Furthermore, the framework offers flexibility in determining the number of linear segments for approximating the GELU function. Table 4 illustrates the impact of different configurations on the mean square error accuracy of approximated functions. Increasing the values of m𝑚mitalic_m and αsuperscript𝛼\alpha^{*}italic_α start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, expanding the number of linear segments in GELU, and choosing LMSR over MSR result in improved accuracy but also consume higher hardware resources, resulting in increased power consumption.

Table 4. Effect of PEANO-ViT parameters on approximations accuracy
Fuction Test input interval Changed parameter MSE
Reciprocal square root [1,128]1128[1,128][ 1 , 128 ] m=3𝑚3m=3italic_m = 3 4.93×1054.93superscript1054.93\times 10^{-5}4.93 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT
m=4𝑚4m=4italic_m = 4 9.56×1069.56superscript1069.56\times 10^{-6}9.56 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
m=5𝑚5m=5italic_m = 5 7.86×1067.86superscript1067.86\times 10^{-6}7.86 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
Reciprocal [8,64]864[8,64][ 8 , 64 ] α=4,MSRsuperscript𝛼4𝑀𝑆𝑅\alpha^{*}=4,MSRitalic_α start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 4 , italic_M italic_S italic_R 4.19×1064.19superscript1064.19\times 10^{-6}4.19 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
α=5,MSRsuperscript𝛼5𝑀𝑆𝑅\alpha^{*}=5,MSRitalic_α start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 5 , italic_M italic_S italic_R 4.03×1064.03superscript1064.03\times 10^{-6}4.03 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
α=4,LMSRsuperscript𝛼4𝐿𝑀𝑆𝑅\alpha^{*}=4,LMSRitalic_α start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 4 , italic_L italic_M italic_S italic_R 3.63×1093.63superscript1093.63\times 10^{-9}3.63 × 10 start_POSTSUPERSCRIPT - 9 end_POSTSUPERSCRIPT
α=5,LMSRsuperscript𝛼5𝐿𝑀𝑆𝑅\alpha^{*}=5,LMSRitalic_α start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 5 , italic_L italic_M italic_S italic_R 3.58×1093.58superscript1093.58\times 10^{-9}3.58 × 10 start_POSTSUPERSCRIPT - 9 end_POSTSUPERSCRIPT
GELU [4,4]44[-4,4][ - 4 , 4 ] 7 segments 2.65×1042.65superscript1042.65\times 10^{-4}2.65 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
10 segments 8.31×1058.31superscript1058.31\times 10^{-5}8.31 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT

5. Conclusion

PEANO-ViT optimizes ViT models by approximating non-linear blocks and eliminating division operations, maintaining high accuracy with minimal reduction. This approach enhances power efficiency and resource savings, setting a new benchmark for sustainable deep learning. Its flexibility allows for customized adjustments in accuracy, hardware resources, and power consumption, ensuring it meets specific performance requirements without sacrificing efficiency or accuracy.

Acknowledgment: This research is supported by a grant from the Software and Hardware Foundations program of the NSF.

References

  • (1)
  • Azizi et al. (2024) Seyedarmin Azizi, Mahdi Nazemi, and Massoud Pedram. 2024. Memory-Efficient Vision Transformers: An Activation-Aware Mixed-Rank Compression Strategy. arXiv:2402.06004 [cs.CV]
  • Bai et al. (2023) Yueyin Bai et al. 2023. LTrans-OPU: A Low-Latency FPGA-Based Overlay Processor for Transformer Networks. In 33rd International Conference on Field-Programmable Logic and Applications, FPL 2023. IEEE, 283–287.
  • Deng et al. (2009) Jia Deng et al. 2009. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.
  • Dosovitskiy et al. (2021) Alexey Dosovitskiy et al. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In 9th International Conference on Learning Representations.
  • Koca et al. (2023) Nazim Altar Koca et al. 2023. Hardware-efficient Softmax Approximation for Self-Attention Networks. In IEEE International Symposium on Circuits and Systems.
  • Li et al. (2023) Tianyang Li et al. 2023. A high speed reconfigurable architecture for softmax and GELU in vision transformer. Electronics Letters 59, 5 (2023), e12751.
  • Liu et al. (2021a) Zhenhua Liu et al. 2021a. Post-Training Quantization for Vision Transformer. In Annual Conference on Neural Information Processing Systems 2021.
  • Liu et al. (2021b) Ze Liu et al. 2021b. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In 2021 IEEE/CVF International Conference on Computer Vision.
  • Peltekis et al. (2024) Christodoulos Peltekis et al. 2024. Reusing Softmax Hardware Unit for GELU Computation in Transformers. (2024). arXiv:2402.10118
  • Stevens et al. (2021) Jacob R. Stevens et al. 2021. Softermax: Hardware/Software Co-Design of an Efficient Softmax for Transformers. In 58th ACM/IEEE Design Automation Conf.
  • Touvron et al. (2021) Hugo Touvron et al. 2021. Training data-efficient image transformers & distillation through attention. In Proceedings of the 38th Int. Conf. on Machine Learning.
  • Vaswani et al. (2017) Ashish Vaswani et al. 2017. Attention is All you Need. In Annual Conference on Neural Information Processing Systems 2017.
  • Wang et al. (2023) Wenxun Wang et al. 2023. SOLE: Hardware-Software Co-design of Softmax and LayerNorm for Efficient Transformer Inference. In IEEE/ACM International Conference on Computer Aided Design.
  • Wightman (2019) Ross Wightman. 2019. PyTorch Image Models. https://github.com/rwightman/pytorch-image-models. https://doi.org/10.5281/zenodo.4414861
  • Yu et al. (2022) Fang Yu et al. 2022. Width & Depth Pruning for Vision Transformers. In Thirty-Sixth AAAI Conference on Artificial Intelligence.