PIVOT- Input-aware Path Selection for Energy-efficient ViT Inference

Abhishek Moitra, Abhiroop Bhattacharjee and Priyadarshini Panda
Yale University, New Haven, CT, 06511, USA
(2018; 2024)
Abstract.

The attention module in vision transformers(ViTs) performs intricate spatial correlations, contributing significantly to accuracy and delay. It is thereby important to modulate the number of attentions according to the input feature complexity for optimal delay-accuracy tradeoffs. To this end, we propose PIVOT - a co-optimization framework which selectively performs attention skip** based on the input difficulty. For this, PIVOT employs a hardware-in-loop co-search to obtain optimal attention skip configurations. Evaluations on the ZCU102 MPSoC FPGA show that PIVOT achieves 2.7×\times× lower EDP at 0.2% accuracy reduction compared to LVViT-S ViT. PIVOT also achieves 1.3% and 1.8×\times× higher accuracy and throughput than prior works on traditional CPUs and GPUs. The PIVOT project can be found at this Github link.

Vision Transformers, Systolic Array Accelerators, Energy-efficiency
copyright: acmlicensedjournalyear: 2018doi: XXXXXXX.XXXXXXXconference: Make sure to enter the correct conference title from your rights confirmation emai; June 03–05, 2018; Woodstock, NYisbn: 978-1-4503-XXXX-X/18/06journalyear: 2024copyright: acmlicensedconference: 61st ACM/IEEE Design Automation Conference; June 23–27, 2024; San Francisco, CA, USAbooktitle: 61st ACM/IEEE Design Automation Conference (DAC ’24), June 23–27, 2024, San Francisco, CA, USAdoi: 10.1145/3649329.3655679isbn: 979-8-4007-0601-1/24/06

1. Introduction

Vision Transformers (ViT) have demonstrated remarkable accuracy in large-scale image classification tasks (Dehghani et al., 2023; Han et al., 2022; Dosovitskiy et al., 2020). The success of ViTs can be attributed to the attention module shown in Fig. 1a which utilizes the self-attention mechanism to perform sophisticated spatial correlation operations (Han et al., 2022). However, the attention module, involves computationally intensive operations, including matrix multiplications and non-linear functions like softmax (Han et al., 2022; Dehghani et al., 2023). Hence, as seen in Fig. 1b, the attention module (QKV+QKT+SM+(SMxV)+Proj combined) contributes 77.5% to 81.9% of the total ViT inference delay.

Recently, there have been several ViT inference optimization works that focus on reducing the attention delay overhead. These mainly fall under two categories 1) Attention sparsification (Kim et al., 2021; You et al., 2023) 2) Token pruning techniques (Dong et al., 2023; Rao et al., 2021; Wang et al., 2021). Attention sparsification techniques exploit the sparsity in the QKT and (SMxV) layers (Kim et al., 2021; You et al., 2023) (Fig. 1a). In (Kim et al., 2021), the authors algorithmically investigate the effect of structured sparsity in the attention heads on ViT accuracy. In a more recent work (You et al., 2023), the authors propose an accelerator co-design framework that performs sparse-dense attention decomposition and develop a sparse accelerator to exploit the attention sparsity. The objective of token pruning is to selectively reduce the number of tokens in the ViT. In (Rao et al., 2021; Wang et al., 2021), the authors use predictor networks to compute the global-local token importance to eliminate redundant tokens. In HeatViT (Dong et al., 2023), the authors use predictor networks to score the token importance based on the information in each attention head. Along with the predictor networks, the authors use a token packaging technique wherein unimportant tokens are combined into one token to maintain a good accuracy-efficiency tradeoff. Although, attention sparsification and token pruning works (Kim et al., 2021; You et al., 2023; Rao et al., 2021; Dong et al., 2023) achieve good accuracy at reduced computation, they have two major problems. Firstly, the portion of delay optimized by these works is small. For example, attention sparsification works are only able to optimize 7.3-7.7% of the overall delay since they target the QKT and (SMxV) layers as shown in Fig. 1b. The second problem is that attention sparsification and token pruning approaches require nuanced hardware support to achieve optimal efficiency. For example, attention sparsification works require sparse matrix multiplication hardware to fully exploit sparse computations. Similarly, token pruning works require custom hardware design to efficiently implement the token score predictor modules. Thus, as shown in Fig. 1c, when implemented on general purpose platforms (GPPs) such as CPUs and GPUs, they do not achieve any inference delay benefits and, in fact, result in lower throughput compared to a dense baseline.

Refer to caption
Figure 1. (a) Figure showing the encoder architecture of a vision transformer. Q-Query, K-Key and V-Value. (b) Delay distribution across different ViT modules for DeiT-S (left) and LVViT-S (right) ViTs. Note, Attention delay is QKV+SM+QKT+(SMxV)+Proj. (c) Throughput of PIVOT compared with DeiT-S Baseline (a standard DeiT-S (Touvron et al., 2021) ViT), prior token pruning (HeatViT (Dong et al., 2023)) and attention sparsification (ViTCOD (You et al., 2023)) techniques implemented on GPUs- Nvidia V100, RTX2080ti, Jetson Orin Nano and CPUs- Intel Xeon and Raspberry Pi 4. (d) PIVOT’s input difficulty-aware inference

Another missing consideration in prior ViT optimization literature is the input difficulty awareness. Interestingly, different images have different feature complexity. For example, an easy image will contain simple, low-level features compared to a difficult image with intricate feature representations (Wu et al., 2018). Since attention modules are responsible for capturing different levels of feature representations in the image, it is therefore imperative to modulate the number of attentions in a ViT according to the input difficulty (Fig. 1d). Modulating the number of attentions according to input difficulty will ensure minimal attention activation to achieve high accuracy at low inference delay. There have been several input difficulty-aware network optimization works in the CNN literature (Panda et al., 2016; Wu et al., 2018; Bhattacharjee et al., 2022). However, there are no works that analyze the co-dependency between the number of attentions and input difficulty from the perspective of accuracy and ViT inference delay.

To this end, we propose PIVOT, a hardware-algorithm co-design framework that modulates the number of attentions in the ViT according to the input difficulty. The goal of PIVOT is to achieve high classification accuracy by using the minimum number of attentions in the ViT. As shown in Fig. 1d during inference, PIVOT uses two kinds of ViTs - 1) Low Effort and 2) High Effort ViT. The low effort ViT entails more attention skips compared to the high effort and classifies the easy images. While the high effort ViT is used for classifying the difficult images. An iterative hardware-in-the-loop co-search is applied to obtain the optimal low and high effort ViTs according to the user-provided delay constraints. For evaluation, we implement PIVOT on various GPPs such as CPUs and GPUs. Additionally, we also evaluate PIVOT on Xilinx ZCU102-implemented systolic array accelerator (Samajdar et al., 2018). Unlike token pruning and attention sparsification works, PIVOT does not require any application-specific hardware and can achieve 1.3×\times×-2×\times× higher throughput than baseline across various GPPs as shown in Fig. 1c.

In summary, the key contributions of our work are:

  1. (1)

    We propose PIVOT- a hardware-algorithm co-optimization framework that leverages input difficulty-aware attention skip** in ViTs to overcome the high inference delay overhead of the attention module. During attention optimization, PIVOT uses PIVOT-Sim, a cycle-accurate simulator for ViT implemented on a Xilinx ZCU102 FPGA-based systolic array accelerator. PIVOT-Sim will be made open-source and can benchmark different state-of-the-art ViTs.

  2. (2)

    Using PIVOT-Sim, we find that PIVOT achieves 1.73×\times× (2.7×\times×) lower energy-delay-product (EDP) at merely 0.4% (0.2%) accuracy reduction compared to DeiT-S (Touvron et al., 2021) (LVViT-S (Jiang et al., 2021)) baselines. End-to-end evaluations using PIVOT-Sim show that PIVOT is able to achieve more than 1.7×\times× energy reduction across different resources in the Xilinx ZCU102 FPGA such as the ZynQ MPSoC PS, systolic array, on-chip buffers, and communication/memory controller circuits.

  3. (3)

    Through extensive experiments we show the overheads introduced by prior ViT co-optimization works (Dong et al., 2023; You et al., 2023) when implemented on GPPs such as GPUs and CPUs. As PIVOT does not require nuanced hardware support, when implemented on GPPs, it achieves 1.8×\times× higher throughput at 0.4-1.3% higher accuracy compared to prior works.

2. Background on Vision Transformer

A Vision Transformer (ViT) comprises multiple cascaded encoders, and each encoder follows the architecture depicted in Fig. 1a. In each encoder, the inputs of dimensions t×d𝑡𝑑t\times ditalic_t × italic_d undergo QKV operations wherein, weights WQsubscript𝑊𝑄W_{Q}italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT, WKsubscript𝑊𝐾W_{K}italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT and WVsubscript𝑊𝑉W_{V}italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT are multiplied with the input to generate the Query (Q), Key (K) and Value (V) matrices. The attention module uses the multi-head self-attention (MHSA) mechanism, that captures close relationships between different image features (Touvron et al., 2021; Jiang et al., 2021; Rao et al., 2021). For this, the Q, K and V outputs are partitioned into multiple smaller attention heads (Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, Kisubscript𝐾𝑖K_{i}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT), where i𝑖iitalic_i denotes a head of MHSA.

The attention is computed using Equation 1. In each head matrix multiplications between Qisubscript𝑄𝑖Q_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, KiTsuperscriptsubscript𝐾𝑖𝑇K_{i}^{T}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT (QKT𝑄superscript𝐾𝑇QK^{T}italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT) is performed followed by the softmax (SM) and matrix multiplication with Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (SM×\times×V) operations (Stevens et al., 2021). The softmax is computed using Equation 2.

(1) Attention(Qi,Ki,Vi)=Softmax(QiKiTd)Vi,𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛subscript𝑄𝑖subscript𝐾𝑖subscript𝑉𝑖𝑆𝑜𝑓𝑡𝑚𝑎𝑥subscript𝑄𝑖superscriptsubscript𝐾𝑖𝑇𝑑subscript𝑉𝑖Attention(Q_{i},K_{i},V_{i})={Softmax(\frac{Q_{i}K_{i}^{T}}{\sqrt{d}})}V_{i},italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_S italic_o italic_f italic_t italic_m italic_a italic_x ( divide start_ARG italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,
(2) Softmax(xi)=exixmaxiexixmax.𝑆𝑜𝑓𝑡𝑚𝑎𝑥subscript𝑥𝑖superscript𝑒subscript𝑥𝑖subscript𝑥𝑚𝑎𝑥subscript𝑖superscript𝑒subscript𝑥𝑖subscript𝑥𝑚𝑎𝑥Softmax(x_{i})=\frac{e^{x_{i}-x_{max}}}{\sum_{i}{e^{x_{i}-x_{max}}}}.italic_S italic_o italic_f italic_t italic_m italic_a italic_x ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG italic_e start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG .

Next, the attention outputs are concatenated resulting in a t×d𝑡𝑑t\times ditalic_t × italic_d output attention matrix. Following this, the projection and MLP layers project the information into a higher dimension feature space. Each encoder outputs a t×d𝑡𝑑t\times ditalic_t × italic_d vector that is forwarded to the subsequent encoder.

3. PIVOT Methodology

Refer to caption
Figure 2. Figure showing (a) Input difficulty-aware inference procedure with PIVOT (b) PIVOT’s Phase 1 (b) Phase2 Methodology. LEC𝐿𝐸𝐶LECitalic_L italic_E italic_C denotes the user-provided low effort constraint which implies the fraction of inputs that must be classified by the low effort ViT. For PIVOT-Sim, ViT params include embedding dim size, mlp ratio etc. and systolic array params include array size, dataflow, etc.

3.1. PIVOT Inference with Low and High Efforts

During PIVOT’s inference, we use the entropy metric to determine the number of attentions required to classify an input (Li et al., 2023). The entropy, E(x)𝐸𝑥E(x)italic_E ( italic_x ), for an input x𝑥xitalic_x (belonging to a dataset with K𝐾Kitalic_K classes) is calculated using Equation 3. Here, π(𝒚|𝒙)𝜋conditional𝒚𝒙\pi(\bm{y}|\bm{x})italic_π ( bold_italic_y | bold_italic_x ) is the logit output of the ViT. The term 1/logK1𝐾1/\log K1 / roman_log italic_K normalizes the final entropy to (0,1]01(0,1]( 0 , 1 ].

(3) E(𝒙)=1logKi=1Kπ(𝒚i|𝒙)logπ(𝒚i|𝒙).𝐸𝒙1𝐾superscriptsubscript𝑖1𝐾𝜋conditionalsubscript𝒚𝑖𝒙𝜋conditionalsubscript𝒚𝑖𝒙E(\bm{x})=-\frac{1}{\log K}\sum_{i=1}^{K}\pi(\bm{y}_{i}|\bm{x})\log\pi(\bm{y}_% {i}|\bm{x}).italic_E ( bold_italic_x ) = - divide start_ARG 1 end_ARG start_ARG roman_log italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_π ( bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_italic_x ) roman_log italic_π ( bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_italic_x ) .

The entropy measures the confidence of prediction. For example, if all classes have an equal probability of 1K1𝐾\frac{1}{K}divide start_ARG 1 end_ARG start_ARG italic_K end_ARG, the entropy value will be 1, implying uncertainty in the prediction. Whereas, if one class’s prediction probability reaches 1 while the other classes attain 0 probability, the entropy reaches 0 implying confident prediction.

As shown in Fig. 2a, during inference, PIVOT uses a combination of two efforts: 1) Low Effort and 2) High Effort. Here, Effort is defined as the number of active attention modules (attentions that are not skipped) in the ViT. First, all inputs are inferred with the low effort resulting in the logit outputs (π(𝒚|𝒙)𝜋conditional𝒚𝒙\pi(\bm{y}|\bm{x})italic_π ( bold_italic_y | bold_italic_x )) and the entropy values (E(x)𝐸𝑥E(x)italic_E ( italic_x )). For inputs with entropy values lower than the threshold (Th𝑇Thitalic_T italic_h), the π(𝒚|𝒙)𝜋conditional𝒚𝒙\pi(\bm{y}|\bm{x})italic_π ( bold_italic_y | bold_italic_x ) from the low effort ViT are used for class prediction. For inputs with E(x)>Th𝐸𝑥𝑇E(x)>Thitalic_E ( italic_x ) > italic_T italic_h, an additional inference is performed with high effort and then, all inputs are inevitably classified. In Fig. 2a, FLsubscript𝐹𝐿F_{L}italic_F start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT and FHsubscript𝐹𝐻F_{H}italic_F start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT are defined as the fraction of inputs classified by low (E(x)<Th𝐸𝑥𝑇E(x)<Thitalic_E ( italic_x ) < italic_T italic_h) and high effort (E(x)>Th𝐸𝑥𝑇E(x)>Thitalic_E ( italic_x ) > italic_T italic_h), respectively. Additionally, the number of inputs correctly (incorrectly) classified with low and high efforts are denoted as CLsubscript𝐶𝐿C_{L}italic_C start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT (ILsubscript𝐼𝐿I_{L}italic_I start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT) and CHsubscript𝐶𝐻C_{H}italic_C start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT (IHsubscript𝐼𝐻I_{H}italic_I start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT), respectively. The CLsubscript𝐶𝐿C_{L}italic_C start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT and CHsubscript𝐶𝐻C_{H}italic_C start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT values are used to compute the accuracy.

Re-computation Overhead: During inference, some of the inputs that are unclassified with the low-effort are re-inferred with the high effort which entails re-computation overhead that needs to be managed to obtain a tradeoff between accuracy vs. efficiency.

3.2. PIVOT Phase1: Optimal Path Selection

PIVOT uses a two-phase hardware-in-the-loop search to design the multi-effort ViT. In Phase1, we select the optimal path for different efforts for a given ViT. Each effort contains multiple Paths. For example, as shown in Fig. 2b, a ViT with 5 encoders and Effort=3 entails (53)=10binomial5310{5\choose 3}=10( binomial start_ARG 5 end_ARG start_ARG 3 end_ARG ) = 10 possible paths. Here, a Path is uniquely defined by the position of encoders with active and inactive attention modules. Having large number of paths for each effort increases the search space size in Phase2. Therefore, we define a Path-Score (shown in Algorithm 1) metric to single-out the Optimal Path (shown in yellow) corresponding to each effort. The path with the highest Path-Score (𝒮𝒮\mathcal{S}caligraphic_S) is chosen as the Optimal Path and trained with the loss function shown in Fig 2b. The loss function contains cross-entropy loss LCEsubscript𝐿𝐶𝐸{L_{CE}}italic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT, and the distillation loss LDistillsubscript𝐿𝐷𝑖𝑠𝑡𝑖𝑙𝑙{L}_{Distill}italic_L start_POSTSUBSCRIPT italic_D italic_i italic_s italic_t italic_i italic_l italic_l end_POSTSUBSCRIPT between the final layer features of the teacher and student ViT. The LCEsubscript𝐿𝐶𝐸L_{CE}italic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT and LDistillsubscript𝐿𝐷𝑖𝑠𝑡𝑖𝑙𝑙L_{Distill}italic_L start_POSTSUBSCRIPT italic_D italic_i italic_s italic_t italic_i italic_l italic_l end_POSTSUBSCRIPT are commonly used in prior works to train high performance ViTs (Touvron et al., 2021; Jiang et al., 2021). In PIVOT, to improve the prediction confidence, we add the regularization term LEnsubscript𝐿𝐸𝑛L_{En}italic_L start_POSTSUBSCRIPT italic_E italic_n end_POSTSUBSCRIPT that lowers the entropy for the correctly classified inputs. LEnsubscript𝐿𝐸𝑛L_{En}italic_L start_POSTSUBSCRIPT italic_E italic_n end_POSTSUBSCRIPT is the mean of the entropy values for the correctly classified inputs. Lowering the entropy ensures increased confident classifications with low efforts and thereby improves the inference efficiency.

Refer to caption
Figure 3. (a) CKAMatrix𝐶𝐾𝐴𝑀𝑎𝑡𝑟𝑖𝑥CKA~{}Matrixitalic_C italic_K italic_A italic_M italic_a italic_t italic_r italic_i italic_x computed between the MLP output of Encoderi𝐸𝑛𝑐𝑜𝑑𝑒subscript𝑟𝑖Encoder_{i}italic_E italic_n italic_c italic_o italic_d italic_e italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (MLPi𝑀𝐿subscript𝑃𝑖MLP_{i}italic_M italic_L italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) and Attention output of Encoderi+1𝐸𝑛𝑐𝑜𝑑𝑒subscript𝑟𝑖1Encoder_{i+1}italic_E italic_n italic_c italic_o italic_d italic_e italic_r start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT (Ai+1subscript𝐴𝑖1A_{i+1}italic_A start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT) for the DeiT-S ViT (b) Higher CKA(MLPi,Ai+1)𝐶𝐾𝐴𝑀𝐿subscript𝑃𝑖subscript𝐴𝑖1CKA(MLP_{i},A_{i+1})italic_C italic_K italic_A ( italic_M italic_L italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ) suggests data redundancy and the attention can be skipped.

CKA Matrix Fig. 3a shows the center kernel alignment matrix (CKAMatrix𝐶𝐾𝐴𝑀𝑎𝑡𝑟𝑖𝑥CKA~{}Matrixitalic_C italic_K italic_A italic_M italic_a italic_t italic_r italic_i italic_x) comprising of the CKA values computed between MLP outputs (MLPi𝑀𝐿subscript𝑃𝑖MLP_{i}italic_M italic_L italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) and attention outputs (Ai+1subscript𝐴𝑖1A_{i+1}italic_A start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT) of ViT encoders Encoderi𝐸𝑛𝑐𝑜𝑑𝑒subscript𝑟𝑖Encoder_{i}italic_E italic_n italic_c italic_o italic_d italic_e italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Encoderi+1𝐸𝑛𝑐𝑜𝑑𝑒subscript𝑟𝑖1Encoder_{i+1}italic_E italic_n italic_c italic_o italic_d italic_e italic_r start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT, respectively. CKA measures the similarity between two matrices (Cortes et al., 2012). A high CKA(MLPi,Ai+1)𝐶𝐾𝐴𝑀𝐿subscript𝑃𝑖subscript𝐴𝑖1CKA(MLP_{i},A_{i+1})italic_C italic_K italic_A ( italic_M italic_L italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ) value implies high similarity in MLPi𝑀𝐿subscript𝑃𝑖MLP_{i}italic_M italic_L italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Ai+1subscript𝐴𝑖1A_{i+1}italic_A start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT outputs, thus suggesting that output MLPi𝑀𝐿subscript𝑃𝑖MLP_{i}italic_M italic_L italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be directly forwarded to MLPi+1𝑀𝐿subscript𝑃𝑖1MLP_{i+1}italic_M italic_L italic_P start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT by skip** Ai+1subscript𝐴𝑖1A_{i+1}italic_A start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT as shown in Fig. 3b (top). Contrarily, for a low CKA(MLPi,Ai+1)𝐶𝐾𝐴𝑀𝐿subscript𝑃𝑖subscript𝐴𝑖1CKA(MLP_{i},A_{i+1})italic_C italic_K italic_A ( italic_M italic_L italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ) value, the attention cannot be skipped as shown in Fig. 3b (bottom).

Input: Effort Configuration (Config𝐶𝑜𝑛𝑓𝑖𝑔Configitalic_C italic_o italic_n italic_f italic_i italic_g), #Encoders in ViT (D𝐷Ditalic_D), CKAMatrix𝐶𝐾𝐴𝑀𝑎𝑡𝑟𝑖𝑥CKA~{}Matrixitalic_C italic_K italic_A italic_M italic_a italic_t italic_r italic_i italic_x.
Output: Path-Score (𝒮)𝒮(\mathcal{S})( caligraphic_S )
1 𝒮=0𝒮0\mathcal{S}=0caligraphic_S = 0;
2 for i𝑖iitalic_i ϵitalic-ϵ\epsilonitalic_ϵ Config𝐶𝑜𝑛𝑓𝑖𝑔Configitalic_C italic_o italic_n italic_f italic_i italic_g do
3       for j𝑗jitalic_j ϵitalic-ϵ\epsilonitalic_ϵ (i+1𝑖1i+1italic_i + 1, D𝐷Ditalic_D) do
4            
5            if (Ajsubscript𝐴𝑗A_{j}italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is Inactive𝐼𝑛𝑎𝑐𝑡𝑖𝑣𝑒Inactiveitalic_I italic_n italic_a italic_c italic_t italic_i italic_v italic_e) then
6                   𝒮𝒮\mathcal{S}caligraphic_S = 𝒮+CKAMatrix(i,j)𝒮𝐶𝐾𝐴𝑀𝑎𝑡𝑟𝑖𝑥𝑖𝑗\mathcal{S}+CKA~{}Matrix(i,j)caligraphic_S + italic_C italic_K italic_A italic_M italic_a italic_t italic_r italic_i italic_x ( italic_i , italic_j );
7                  
8            else
9                   break;
10                  
11            
12      
Algorithm 1 Path-Score Computation Algorithm
Refer to caption
Figure 4. (a) Path Accuracy vs. Path-Score (𝒮𝒮\mathcal{S}caligraphic_S) corresponding to Effort = 6 for DeiT-S ViT. (b) Design space size if random search is performed in Phase2, without selecting optimal path for each effort in Phase1 (size normalized to PIVOT’s design space size) (c) GPU hours for training DeiT-S, LVViT-S and PIVOT Efforts (normalized to GPU hours required for training DeiT-S from scratch).

Path-Score (𝒮𝒮\mathcal{S}caligraphic_S): Algorithm 1 shows the methodology to compute 𝒮𝒮\mathcal{S}caligraphic_S. Algorithm 1 requires the CKAMatrix𝐶𝐾𝐴𝑀𝑎𝑡𝑟𝑖𝑥CKA~{}Matrixitalic_C italic_K italic_A italic_M italic_a italic_t italic_r italic_i italic_x (shown in Fig. 3a) and the effort configuration (Config𝐶𝑜𝑛𝑓𝑖𝑔Configitalic_C italic_o italic_n italic_f italic_i italic_g), containing encoder locations with active and inactive attention. The CKAMatrix𝐶𝐾𝐴𝑀𝑎𝑡𝑟𝑖𝑥CKA~{}Matrixitalic_C italic_K italic_A italic_M italic_a italic_t italic_r italic_i italic_x is generated for a small batch of 256 images. For a given Config𝐶𝑜𝑛𝑓𝑖𝑔Configitalic_C italic_o italic_n italic_f italic_i italic_g, 𝒮𝒮\mathcal{S}caligraphic_S is computed by summing up the CKA values between the MLP outputs (MLP𝑀𝐿𝑃MLPitalic_M italic_L italic_P) of the encoders with active attention and the attention outputs (A𝐴Aitalic_A) of the encoders with inactive attention. For example, 𝒮𝒮\mathcal{S}caligraphic_S for Config=𝐶𝑜𝑛𝑓𝑖𝑔absentConfig=italic_C italic_o italic_n italic_f italic_i italic_g = [1,2,3,4,5,6,7,8,9,10,11,12], where encoder indices of inactive attentions are denoted by cyan can be computed as CKA[MLP2,A3]+CKA[MLP2,A4]+CKA[MLP8,A9]+CKA[MLP8,A10]𝐶𝐾𝐴𝑀𝐿subscript𝑃2subscript𝐴3𝐶𝐾𝐴𝑀𝐿subscript𝑃2subscript𝐴4𝐶𝐾𝐴𝑀𝐿subscript𝑃8subscript𝐴9𝐶𝐾𝐴𝑀𝐿subscript𝑃8𝐴10CKA[MLP_{2},A_{3}]+CKA[MLP_{2},A_{4}]+CKA[MLP_{8},A_{9}]+CKA[MLP_{8},A{10}]italic_C italic_K italic_A [ italic_M italic_L italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ] + italic_C italic_K italic_A [ italic_M italic_L italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ] + italic_C italic_K italic_A [ italic_M italic_L italic_P start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT ] + italic_C italic_K italic_A [ italic_M italic_L italic_P start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT , italic_A 10 ]. A high 𝒮𝒮\mathcal{S}caligraphic_S signifies that the path contains highly redundant attentions that can be easily pruned out. Fig. 4a shows the positive correlation between 𝒮𝒮\mathcal{S}caligraphic_S and path accuracy. As high S𝑆Sitalic_S paths ensure pruning the most redundant attention blocks, they attain higher accuracy.

3.3. PIVOT Phase2: Selecting Optimal Effort Combinations

In Phase2, given a set of efforts with optimal paths (shown in blue in Fig. 2c), PIVOT determines the right effort combination to achieve optimal accuracy while meeting the user-provided delay requirement. 1) First, we start with a pair of low and high efforts (say, Effort 9 and Effort 12). 2) Next, the threshold values Th𝑇Thitalic_T italic_h for the low effort inference is chosen. The Th𝑇Thitalic_T italic_h values are iterated in an incremental manner. 3) A small batch of data (randomly sampled batch of 256 images from the training set) is inferred with the low and high efforts. This generates the CLsubscript𝐶𝐿C_{L}italic_C start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT, CHsubscript𝐶𝐻C_{H}italic_C start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT, FLsubscript𝐹𝐿F_{L}italic_F start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT and FHsubscript𝐹𝐻F_{H}italic_F start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT values. 4) Following this, the accuracy calculator uses CLsubscript𝐶𝐿C_{L}italic_C start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT and CHsubscript𝐶𝐻C_{H}italic_C start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT to compute the accuracy (Fig. 2a). The thresholds are iterated until the condition FLLECsubscript𝐹𝐿𝐿𝐸𝐶F_{L}\geq LECitalic_F start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ≥ italic_L italic_E italic_C is met. Higher LEC𝐿𝐸𝐶LECitalic_L italic_E italic_C value ensures more inputs classified by the low effort ViT. 5) The low (ConfigL𝐶𝑜𝑛𝑓𝑖subscript𝑔𝐿Config_{L}italic_C italic_o italic_n italic_f italic_i italic_g start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT), high (ConfigH𝐶𝑜𝑛𝑓𝑖subscript𝑔𝐻Config_{H}italic_C italic_o italic_n italic_f italic_i italic_g start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT) effort configurations, FLsubscript𝐹𝐿F_{L}italic_F start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT and FHsubscript𝐹𝐻F_{H}italic_F start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT values are passed to the PIVOT-Sim framework for delay computation. The PIVOT-Sim platform first computes the delays of low and high efforts (DLsubscript𝐷𝐿D_{L}italic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT and DHsubscript𝐷𝐻D_{H}italic_D start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT, respectively) using ConfigL𝐶𝑜𝑛𝑓𝑖subscript𝑔𝐿Config_{L}italic_C italic_o italic_n italic_f italic_i italic_g start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT, ConfigH𝐶𝑜𝑛𝑓𝑖subscript𝑔𝐻Config_{H}italic_C italic_o italic_n italic_f italic_i italic_g start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT, ViT and systolic array parameters (Refer Section 3.4). Then, it computes the delay of the effort combination using DLsubscript𝐷𝐿D_{L}italic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT, DHsubscript𝐷𝐻D_{H}italic_D start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT, FLsubscript𝐹𝐿F_{L}italic_F start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT and FHsubscript𝐹𝐻F_{H}italic_F start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT. If the delay lies within 5% of the user-provided delay constraint, the optimal effort combination is obtained. If the delay constraint is not met, a new effort combination (say, Effort 6 and Effort 9) is selected. In order to achieve high accuracy, the sampling starts with efforts containing maximum active attentions. In each iteration, a smaller effort combination is sampled than the previous iteration until the desired delay is obtained.

Benefit of CKA Score-based Optimal Path Selection In Fig. 4b, we compare the Phase2 design space size of random and PIVOT-based search. Since PIVOT uses the Path-score to single out the optimal path for each effort, there exists only one path for each effort combination. Whereas, Phase2 with random search entails multiple paths due to the absence of optimal path selection. For example, in random search as shown in Fig. 4b, effort combinations [3,6] can contain (123)×(126)=2.03×105binomial123binomial1262.03superscript105{12\choose 3}\times{12\choose 6}=2.03\times 10^{5}( binomial start_ARG 12 end_ARG start_ARG 3 end_ARG ) × ( binomial start_ARG 12 end_ARG start_ARG 6 end_ARG ) = 2.03 × 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT possible paths for the DeiT-S ViT. For the DeiT-S ViT Phase2 with random search, the search space size is 105×\sim 10^{5}\times∼ 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT × higher than PIVOT’s search space size.

GPU Hours for Training all Efforts: Fig. 4c shows that the combined GPU hours required for training all efforts (see Section 4.1) for DeiT-S (LVViT-S) ViTs in PIVOT is 3×\times× (2×\times×) less compared to training the DeiT-S (LVViT-S) ViT from scratch. This is because, the training time reduces with reduction in the ViT effort.

3.4. PIVOT-Sim Platform

Refer to caption
Figure 5. Figure showing the PIVOT-Sim Platform.

Fig. 5 shows the overall architecture of the PIVOT-Sim platform. PIVOT-Sim performs cycle-accurate delay estimation for a given ViT effort mapped on a Xilinx ZCU102 MPSoC FPGA-based systolic array accelerator. Like the ZynQ MPSOC FPGA, PIVOT-Sim contains two systems: 1) ZynQ MPSoC Processing System (PS) and 2) ZynQ MPSoC programmable logic (PL). All the linear matrix multiplication layers (QKV, QKT, SMxV, Proj and MLP) are executed in the PL-implemented systolic accelerator. Inputs and weights are first loaded from the PS DRAM to the global SRAM buffer (GB) in the PL. Then the weights and inputs are fetched from GB to the Weight SRAM (WTMEM) and Input SRAM (IPMEM), respectively. Then, the weights from the WTMEM are loaded in to the PE array in a streaming fashion following which, the inputs are fetched from the IPMEM in a streaming fashion column by column. The multiply-and-accumulate (MAC) outputs are stored in the output SRAM (OPMEM). The outputs are pushed to the GB and finally to the DRAM. The non-linear operations such as softmax, entropy and GeLU are implemented using the ZynQ MPSoC PS.

The PIVOT-Sim framework requires the ViT parameters (embedding dimension size, number of tokens, mlp ratio and attention head count) and systolic array parameters (array dimensions, dataflow, SRAM memory sizes, and the clock frequency) and the low (high) effort configurations ConfigL𝐶𝑜𝑛𝑓𝑖subscript𝑔𝐿Config_{L}italic_C italic_o italic_n italic_f italic_i italic_g start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT (ConfigH𝐶𝑜𝑛𝑓𝑖subscript𝑔𝐻Config_{H}italic_C italic_o italic_n italic_f italic_i italic_g start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT) (discussed in Section 3.2) to compute the low (high) effort delays DLsubscript𝐷𝐿D_{L}italic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT (DHsubscript𝐷𝐻D_{H}italic_D start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT). Additionally, it also computes the delay of low-high effort combination using the FLsubscript𝐹𝐿F_{L}italic_F start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT, FHsubscript𝐹𝐻F_{H}italic_F start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT, DLsubscript𝐷𝐿D_{L}italic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT and DHsubscript𝐷𝐻D_{H}italic_D start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT values as shown in Fig. 5. The DL×FHsubscript𝐷𝐿subscript𝐹𝐻D_{L}\times F_{H}italic_D start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT × italic_F start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT term in the delay computation accounts for the re-computation overhead. The energy is obtained by multiplying the power with the delay of the effort combination.

Entropy Computation Overhead We find that entropy computation (Equation 3) in the ZynQ MPSoC PS takes 0.03ms per image which is <0.05%absentpercent0.05<0.05\%< 0.05 % of the inference delay and thus, can be ignored.

4. Experiments and Results

4.1. Experimental Setup

Datasets and ViTs: We benchmark all our results on the standard Imagenet-1K dataset using state-of-the-art efficient ViTs such as DeiT (Touvron et al., 2021) and LV-ViT (Jiang et al., 2021). Baseline: For all experiments, the baseline is a ViT model without any effort modulation i.e., all ViT attention modules will be activated irrespective of the input difficulty. PIVOT-Optimized ViTs: For ease of expression, throughout the text, we will refer to PIVOT-optimized DeiT-S and LVViT-S ViT as PVDS and PVLS, respectively.

Traning Details: In PIVOT, for the DeiT-S and LVViT-S ViTs, we create 7 (3, 4, 5, 6, 7, 8 and 9) and 9 (4, 5, 6, 7, 8, 9, 10, 11, 12) efforts, respectively. Each effort is finetuned for 30 epochs with the full training data. The ViTs are trained with 8-bit quantization. Training all the efforts is 3×\times× (2×\times×) more efficient than training a DeiT-S (LVViT-S) ViT from scratch (see Fig. 4c). For training we use Pytorch 1.3.1 with a single Nvidia V100 GPU backend.

Hardware Evaluation: All baselines and PIVOT-optimized ViTs (PVDS and PVLS) are evaluated using the PIVOT-Sim framework. The FPGA implementation parameters for PIVOT-Sim are shown in Table 1. The FPGA implementation requires 4566 LUTs, 20668 Registers, 48 Block RAMs and 2304 digital signal processing cores.

FPGA Board Xilinx ZCU102
Global SRAM (GB) Size 16KB
IPMEM, WTMEM, OPMEM 64Kb, 64Kb, 64Kb
PE Array Size 64×\times×36
Clock Frequency 125MHz
Dataflow Input Stationary
Table 1. Table showing the FPGA implementation parameters.

4.2. Results on DeiT-S and LVViT-S ViTs

Table 2. Table comparing the performance of DeiT-S and PIVOT-optimized DeiT-S ViTs (PVDS-N𝑁Nitalic_N) sampled at delay=N𝑁Nitalic_N.
Model Energy (J) Delay (ms) Power (W) EDP (J×\times×ms) FPS/W Accuracy (%)
DeiT-S 0.47 59.66 7.92 28.19 2.14(1×\times×) 79.8
PVDS-50 0.38 (1.23×\times×) 48.47 (1.23×1.23\times1.23 ×) 7.92 16.21 (1.73×\times×) 2.7(1.23×\times×) 79.4
PVDS-35 0.292 (1.62×\times×) 36.9 (1.61×1.61\times1.61 ×) 7.92 10.5 (2.6×\times×) 3.4(1.61×\times×) 78.2
Table 3. Table comparing the performance of LVViT-S and PIVOT-optimized LVViT-S ViTs (PVLS-N𝑁Nitalic_N) sampled at delay=N𝑁Nitalic_N.
Model Energy (J) Delay (ms) Power (W) EDP (J×\times×ms) FPS/W Accuracy (%)
LVViT-S 0.63 79.55 7.92 50.8 1.57(1×\times×) 82.8
PVLS-50 0.410 (1.57×\times×) 50 (1.6×1.6\times1.6 ×) 7.92 20.13 (2.7×\times×) 2.51(1.6×\times×) 82.6
PVLS-35 0.312 (2.17×\times×) 36.5 (2.17×2.17\times2.17 ×) 7.92 10.57 (4.5×\times×) 3.4(2.17×\times×) 81.1

Table 2 and Table 3 compare the delay, energy-delay-product (EDP), energy efficiency (FPS/W) and the accuracy of different PVDS and PVLS ViTs searched at different target delays lesser than the baseline. Evidently, as seen in Table 2 the PVDS-50 (PVLS-50) ViTs achieve 1.73×\times× (2.7×\times×) EDP reduction, 1.23×\times× (1.6×\times×) higher FPS/W with merely 0.4% (0.2%) accuracy reduction compared to the baseline DeiT-S (LVViT-S). At a slightly higher accuracy reduction of 1.6% (1.7%) the PVDS-35 (PVLS-35) yields 2.6×\times× (4.5×\times×) lower EDP and 1.62×\times× (2.17×\times×) higher FPS/W compared to baseline.

Refer to caption
Figure 6. (a) Delay breakdown across encoder modules for different ViTs (b) Energy breakdown across the PE Array, Periphery and SRAM (part of the ZynQ MPSoC PL) and the PS (ZynQ MPSoC PS).

Fig. 6a shows the delay distributions across the Attention MAC (QKV, QKT, (SMxV) and Proj), Softmax and MLP modules (refer Fig. 1a). Interestingly, the softmax module consumes 60% (63%) of the overall delay in the DeiT-S (LVViT-S) ViTs. With PIVOT, the softmax overhead reduces to 43% (48%) for the PVDS-50 (PVLS-50) ViTs. Similarly, the Attention MAC overhead reduces to 13% (14%) in the PVDS-50 (PVLS-50) ViTs compared to 18% (19%) in DeiT-S (LVViT-S) ViTs. Note, since PIVOT does not skip MLP modules, the delay overhead of MLP in PVDS-50 (PVLS-50) increase by 21% (19%) compared to the baselines due to the re-computation overhead (refer Section 3.1). However, due to high delay reduction in softmax and attention MAC modules, PIVOT achieves an overall delay reduction.

Energy Reduction across FPGA Resources: As seen in Fig. 6b, delay reduction in PVDS-50 and PVLS-50 ViTs lead to an energy reduction across the ZynQ MPSoC PS and PL systems. PVDS-50 and PVLS-50 ViTs achieve around 2×\times× energy reduction in the PS and 1.6×\times×, 1.7×\times× and 1.8×\times× energy reduction in the PE-Array, SRAM memories and peripheral circuits, respectively implemented on the ZynQ MPSoC PL (See Section. 3.4). The peripheral circuits (periphery) include PS-PL interconnects, reset and memory controllers.

4.3. Comparison with Prior Works

Table 4. Performance comparison of ViTCOD (You et al., 2023), HeatViT (Dong et al., 2023) and PVDS-50.
Work ViTCOD (You et al., 2023) HeatViT (Dong et al., 2023) PIVOT (Ours)
ViT Backbone DeiT-S DeiT-S DeiT-S
Effort Modulation Constant Constant Input-aware
Prediction Norm Head Entropy
Mechanism Score Level Metric
Quantization 8-bits 8-bits 8-bits
Accuracy 78.1% 79.1% 79.4%
GPP Compatible ×\times× ×\times×

Table 4 performs a holistic comparison between PIVOT and prior state-of-the-art algorithm-hardware co-design frameworks (Dong et al., 2023; You et al., 2023). Soft token pruning in HeatViT (Dong et al., 2023) achieves a high token pruning ratio of 40%, 74% and 87% in encoders 4-6, 7-9, and 10-12, respectively, while achieving 79.1% accuracy. ViTCOD (You et al., 2023) achieves 90% attention sparsity ratio at 78.1% accuracy. Accuracy advantage in PIVOT: HeatViT (Dong et al., 2023) and ViTCOD (You et al., 2023) do not modulate their efforts based on the input difficulty (token and attention sparsity ratios remain constant for all inputs). Therefore, at high token and attention pruning ratios, the accuracy suffers as difficult images are wrongly classified. Whereas, due to input-awareness, PIVOT (PVDS-50) achieves the highest accuracy of 79.4%.

Refer to caption
Figure 7. Compute and overhead delay breakdowns for DeiT-S baseline, HeatViT (Dong et al., 2023), ViTCOD (You et al., 2023) and PIVOT (PVDS-50) across (a) Nvidia V100, NVidia RTX2080ti and Nvidia Jetson Orin Nano (b) Intel Xeon and Raspberry Pi 4.

Evaluation on GPPs: As HeatViT (Dong et al., 2023) and ViTCOD (You et al., 2023) require special hardware support for efficient implementation, we perform the delay comparison on GPPs such as CPUs- Intel Xeon, Raspberry Pi, and GPUs- Nvidia V100, Nvidia RTX2080ti and Nvidia Jetson Orin Nano for a fair comparison. As seen in Fig. 7a and Fig. 7b, the PIVOT (PVDS-50) achieves around 1.2-1.5×\times× lower delay compared to the baseline across all GPPs. Since ViTCOD requires sparse matrix multiplication support, the delay on GPP is similar to the baseline. Due to hefty predictor networks and token packaging modules for soft token pruning, HeatViT (Dong et al., 2023) entails significant delay overhead when implemented on GPPs. PIVOT is general purpose and entails a small overhead of 6% in the delay. This delay is majorly contributed by the re-computation overhead. The contribution of entropy computation (Equation 3) is negligibly small (<0.05%absentpercent0.05<0.05\%< 0.05 %).

4.4. Analysis with LEC𝐿𝐸𝐶LECitalic_L italic_E italic_C Constraints

Refer to caption
Figure 8. Figure analysing the effect of different LEC𝐿𝐸𝐶LECitalic_L italic_E italic_C on the EDP and accuracy for different effort combinations. (b) EDP distribution between the low effort, high effort and the re-computation overhead (Overhead) for the PVDS-50 ViT.

From Fig. 8 we find that LEC=70𝐿𝐸𝐶70LEC=70italic_L italic_E italic_C = 70 and LEC=80𝐿𝐸𝐶80LEC=80italic_L italic_E italic_C = 80 attain the best EDP and accuracy tradeoff across different PVDS ViTs. At low LEC=60𝐿𝐸𝐶60LEC=60italic_L italic_E italic_C = 60, the EDP is high as merely 60% of the inputs are classified by the low effort. Additionally, LEC=90𝐿𝐸𝐶90LEC=90italic_L italic_E italic_C = 90 entails 90% of the inference with low effort but this leads to a significant accuracy degradation.

The EDP is contributed by the low effort and high effort inference, and the re-computation overhead (Section 3.1). At low LEC𝐿𝐸𝐶LECitalic_L italic_E italic_C values, both high-effort and re-computation EDPs are high while the low effort EDP is less. As the LEC𝐿𝐸𝐶LECitalic_L italic_E italic_C value increases, the low effort EDP increases marginally while the high effort and re-computation EDP reduce significantly leading to overall low EDPs.

Need for Input difficulty awareness As seen in Fig. 8a for LEC=100𝐿𝐸𝐶100LEC=100italic_L italic_E italic_C = 100, all inputs are inferred by the low effort. This leads to low EDP at the cost of accuracy since the efforts are not modulated for difficult inputs. Therefore, PIVOT’s input-aware effort modulation achieves optimal accuracy-efficiency tradeoffs.

4.5. Efforts Combinations for Different Delays

Refer to caption
Figure 9. Different PVDS ViTs sampled by PIVOT at different delay constraints.

As seen in Fig. 9, reduction in the delay requirement lowers the number of active attentions in the ViT. The efforts shown here represent the optimal path with the highest Path-score for each effort. Interestingly, we observe that across all efforts, attentions skip** is preferred in the deeper layers as the CKA(MLP,A)𝐶𝐾𝐴𝑀𝐿𝑃𝐴CKA(MLP,A)italic_C italic_K italic_A ( italic_M italic_L italic_P , italic_A ) value is higher in the latter layers.

5. Conclusion

PIVOT motivates ViT attention optimization in an input difficulty-aware manner. PIVOT’s input-awareness yields 0.4%-1.3% higher accuracy compared to prior token pruning and attention sparsification works. Unlike prior works, PIVOT is GPP compatible and yields 1.2-1.5×\times× higher throughput compared to baseline ViT across different CPU/GPU platforms. Additionally, PIVOT-Sim- an end-to-end open source FPGA-based evaluation platform is developed that will motivate future ViT-hardware co-optimization works.

Acknowledgement

This work was supported in part by CoCoSys, a JUMP2.0 center sponsored by DARPA and SRC, the National Science Foundation (CAREER Award, Grant #2312366, Grant #2318152), and the DoE MMICC center SEA-CROGS (Award #DE-SC0023198)

References

  • (1)
  • Bhattacharjee et al. (2022) Bhattacharjee et al. 2022. MIME: adapting a single neural network for multi-task inference with memory-efficient dynamic pruning. In Proceedings of the 59th ACM/IEEE Design Automation Conference. 499–504.
  • Cortes et al. (2012) Cortes et al. 2012. Algorithms for learning kernels based on centered alignment. The Journal of Machine Learning Research 13, 1 (2012), 795–828.
  • Dehghani et al. (2023) Dehghani et al. 2023. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning. PMLR, 7480–7512.
  • Dong et al. (2023) Dong et al. 2023. Heatvit: Hardware-efficient adaptive token pruning for vision transformers. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 442–455.
  • Dosovitskiy et al. (2020) Dosovitskiy et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
  • Han et al. (2022) Han et al. 2022. A survey on vision transformer. IEEE transactions on pattern analysis and machine intelligence 45, 1 (2022), 87–110.
  • Jiang et al. (2021) Jiang et al. 2021. All tokens matter: Token labeling for training better vision transformers. Advances in neural information processing systems 34 (2021), 18590–18602.
  • Kim et al. (2021) Kim et al. 2021. Rethinking the self-attention in vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3071–3075.
  • Li et al. (2023) Li et al. 2023. Input-aware dynamic timestep spiking neural networks for efficient in-memory computing. In 2023 60th ACM/IEEE Design Automation Conference (DAC). IEEE, 1–6.
  • Panda et al. (2016) Panda et al. 2016. Conditional deep learning for energy-efficient and enhanced pattern recognition. In 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 475–480.
  • Rao et al. (2021) Rao et al. 2021. Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems 34 (2021), 13937–13949.
  • Samajdar et al. (2018) Samajdar et al. 2018. Scale-sim: Systolic cnn accelerator simulator. arXiv preprint arXiv:1811.02883 (2018).
  • Stevens et al. (2021) Stevens et al. 2021. Softermax: Hardware/software co-design of an efficient softmax for transformers. In 2021 58th ACM/IEEE Design Automation Conference (DAC). IEEE, 469–474.
  • Touvron et al. (2021) Touvron et al. 2021. Training data-efficient image transformers & distillation through attention. In International conference on machine learning. PMLR, 10347–10357.
  • Wang et al. (2021) Wang et al. 2021. Spatten: Efficient sparse attention architecture with cascade token and head pruning. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 97–110.
  • Wu et al. (2018) Wu et al. 2018. Blockdrop: Dynamic inference paths in residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 8817–8826.
  • You et al. (2023) You et al. 2023. Vitcod: Vision transformer acceleration via dedicated algorithm and accelerator co-design. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 273–286.