HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: parnotes

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
arXiv:2403.09026v2 [cs.AR] 11 Apr 2024

FlexNN: A Dataflow-aware Flexible Deep Learning Accelerator for Energy-Efficient Edge Devices

Arnab Raha, Deepak A. Mathaikutty, Soumendu K. Ghosh and Shamik Kundu *Shamik Kundu contributed to this work during his internship in the Advanced Architecture Research Group during summers ’22 and ’23. He is currently a PhD student at UT Dallas. (Email: [email protected]) Advanced Architecture Research, NPU IP, CGAI (CCG), Intel Corporation, Santa Clara, CA, USA
Email: {arnab.raha, deepak.a.mathaikutty, soumendu.ghosh, shamik.kundu}@intel.com
Abstract

This paper introduces FlexNN, a Flexible Neural Network accelerator, which adopts agile design principles to enable versatile dataflows, enhancing energy efficiency. Unlike conventional convolutional neural network accelerator architectures that adhere to fixed dataflows (such as input, weight, output, or row stationary) for transferring activations and weights between storage and compute units, our design revolutionizes by enabling adaptable dataflows of any type through software configurable descriptors. Considering that data movement costs considerably outweigh compute costs from an energy perspective, the flexibility in dataflow allows us to optimize the movement per layer for minimal data transfer and energy consumption, a capability unattainable in fixed dataflow architectures. To further enhance throughput and reduce energy consumption in the FlexNN architecture, we propose a novel sparsity-based acceleration logic that utilizes fine-grained sparsity in both the activation and weight tensors to bypass redundant computations, thus optimizing the convolution engine within the hardware accelerator. Extensive experimental results underscore a significant enhancement in the performance and energy efficiency of FlexNN relative to existing DNN accelerators.

Index Terms:
Deep neural network accelerator, flexible data flow, sparsity acceleration, energy efficiency.

I Introduction

The landscape of machine learning is experiencing an unprecedented surge, with a multitude of artificial intelligence (AI) networks proposed along with the development of numerous hardware platforms dedicated to accelerating Deep Neural Network (DNN) inference tasks. As the field progresses, the complexity of DNNs continues to grow, resulting in handling of large amounts of tensor data that exhibit diverse shapes and dimensions across different layers of existing networks. Moreover, with the continuous introduction of new networks, the dimensions of these tensor data are in constant flux. Consequently, there is a pressing need to engineer hardware accelerators with the flexibility to efficiently process network layers of varying dimensions [1, 2].

Furthermore, the proliferation of edge devices, including wearables, smart cameras, smartphones, and surveillance platforms, underscores the importance of energy efficiency in the design of DNN accelerators [3]. Given that tensor data processing involves traversing multiple levels of memory hierarchy, minimizing data transfer while maximizing data reuse and resource utilization emerges as critical imperatives to improve the energy efficiency of DNN accelerators [4].

However, prevailing accelerators for DNN execution, such as Eyeriss [5] and TPU [6], typically adopt custom memory hierarchies and fixed dataflows. These architectures dictate the sequence in which the tensor data for the activations and weights are moved to processing units to execute the tensor operations for each layer of the network. Although effective, these approaches may not fully exploit the potential for energy efficiency optimization inherent in more flexible hardware designs. Therefore, there is a growing interest in exploring novel architectures and strategies that can better adapt to the evolving demands of DNN inference while simultaneously improving energy efficiency across a range of edge devices and AI applications.

The energy consumption for each layer in DNN inference is heavily influenced by the movement of data across the memory hierarchy and the level of reuse within the processing units. Previous studies have endeavored to characterize energy efficiency through analytical models while stressing the importance of enabling flexibility in scheduling tensors of varying dimensions [7]. This flexibility involves optimizing the ordering, blocking, and partitioning of tensors to maximize reuse from the innermost memory hierarchy, where the energy cost per unit of data moved is minimized. However, most existing DNNs, such as ResNet, YOLO, VGG, and GoogLeNet, comprise tens to hundreds of layers, each with different preferences for scheduling to achieve energy optimality. Fixed-schedule DNN accelerators can only offer optimal data reuse and resource utilization for a subset of DNN layers, thus limiting overall energy efficiency. Moreover, these accelerators exhibit strong network dependencies, which poses challenges in adapting to the rapidly evolving landscape of DNNs. Existing DNN accelerator designs from both industry and academia predominantly employ fixed schedules, such as input stationary (IS), weight stationary (WS), output stationary (OS), non-local reuse (NR), and row stationary (RS) [8, 9]. The fixed dataflow characteristic of these accelerators originates from their tensor data distribution modules, which perform addressing to on-die storage, data transfer to processing engine arrays, and data storage to SRAM banks in a predetermined manner. As a result, these accelerators lack the flexibility to implement different schedules (i.e., dataflows). Although software solutions on general-purpose CPUs and GPUs can reshape and load tensor data, fixed-function accelerators do not support flexibility. FPGAs, although offering flexibility, cannot alter the hardware configuration during execution from one layer to another.

In contrast to previous approaches, this paper proposes a dataflow-aware flexible DNN accelerator that leverages schedule information from DNN layers to adapt tensor data shape and internal compute configuration per layer. This enables the compiler to configure the DNN accelerator optimally for handling tensor operations based on tensor dimensions. The key advantage of our proposed accelerator design lies in its ability to switch among multiple schedules based on layer characteristics, thereby minimizing memory accesses for a given tensor operation and resulting in significant energy savings at the accelerator level.

To further enhance performance and increase energy efficiency in the accelerator, we capitalize on the inherent sparsity in DNNs. Due to the nature of DNNs, weights associated with the network are often “sparse,” which means that they contain a significant number of zeros generated during the training phase [10, 11]. These zero-valued weights do not contribute to the accumulation of partial sums during multiply-and-accumulate (MAC) operations. Additionally, highly sparse weights cause activations to become sparse in subsequent layers of the DNN after passing through non-linear activation functions like ReLU. Furthermore, network quantization (INT8/INT4) for edge device inference also results in a high number of zeros in both weights and activations. This fine-grained unstructured sparsity in weights and activations offers potential for improved energy efficiency and processing speed in two ways: (1) MAC computation can be gated or skipped, and (2) weights and activations can be compressed to reduce storage and data movement. The former reduces energy consumption, while the latter reduces both energy consumption and processing cycles. However, designing DNN accelerators to harness these benefits from sparsity is challenging due to irregular access patterns, workload imbalances, and under-utilization of MAC-based processing elements [12]. Hence, in this paper, we develop a novel sparsity acceleration logic capable of skip** computation of zero-valued compressed data while simultaneously identifying non-zero elements in both activation and weight tensors. This will facilitate the implementation of an efficient convolution engine in the hardware accelerator at the edge, enabling efficient utilization of resources and enhancing overall performance and energy efficiency.

In this paper, we introduce FlexNN, a Flexible Neural Network accelerator, designed with agile principles to support versatile dataflows, thereby enhancing energy efficiency. Recognizing that data movement costs significantly outweigh compute costs in terms of energy consumption, the flexibility in dataflow enables us to optimize data transfer per layer, leading to minimal data movement and reduced energy consumption, an advantage not achievable in fixed dataflow architectures. Furthermore, to further boost throughput and reduce energy consumption within the FlexNN architecture, we propose an innovative sparsity-based acceleration logic. This logic harnesses fine-grained sparsity in both activation and weight tensors to bypass redundant computations, effectively optimizing the convolution engine within the hardware accelerator. In summary, this paper makes the following contributions.

  • This paper introduces a novel DNN accelerator, FlexNN, designed to be sensitive to dataflow, offering flexibility by integrating DNN layer scheduling insights. By dynamically adjusting tensor data shape and internal compute configuration for each layer, the accelerator allows the compiler to optimize its performance in handling tensor operations, tailoring its configurations based on tensor dimensions for diverse neural network architectures.

  • We introduce a novel sparsity acceleration logic that capitalizes on the unstructured fine-grained sparsity present in incoming activation and weights, thereby expediting inference execution within the DNN accelerator. Data are maintained in a zero-compressed format to mitigate storage and data movement expenses. Weights and activations are mapped while considering sparsity to enhance reuse, thereby enhancing overall performance.

  • Extensive experimental evaluations conducted on six distinct DNNs that span both image classification and object detection tasks highlight the transformative impact of our accelerator. Specifically, our architecture showcases substantial improvements over fixed-schedule accelerators for ResNet101 and YOLOv2, demonstrating up to 77%percent7777\%77 % and 62%percent6262\%62 % energy reduction over Eyeriss and TPU, respectively. Furthermore, our accelerator achieves notable sparsity improvements for four additional DNNs, namely ResNet50, MobileNetV2, GoogLeNet, InceptionV3. Across these benchmarks, FlexNN achieves a speedup of 1.8×1.8\times1.8 ×3.3×3.3\times3.3 × over dense accelerators and 1.7×1.7\times1.7 ×2.0×2.0\times2.0 × over semi-sparse accelerators with weight-sparsity support. Sparsity support also provides 1.7×1.7\times1.7 ×3.0×3.0\times3.0 × and 1.6×1.6\times1.6 ×1.8×1.8\times1.8 × improvement in energy efficiency compared to dense and weight-sparse accelerator. These results underscore the profound impact of our accelerator in enabling efficient execution of sparse and compact DNNs, significantly enhancing both speed and energy consumption metrics.

The remainder of the paper is organized as follows. Section II delineates the need for flexible dataflow and efficient two-sided sparsity acceleration logic. Section III describes the microarchitectural details of the proposed FlexNN accelerator. Section IV presents the experimental setup followed by the results in Section V. The prior art in this domain is described in Section VI. Finally, Section VII concludes the paper.

II Motivation

In this section, we delve into the fundamental motivations driving the design and development of our accelerator architecture, focusing on two key aspects: the paramount importance of flexibility and the critical need for efficient sparsity acceleration. By addressing these critical considerations, our accelerator aims to revolutionize the landscape of deep learning (DL) hardware, offering unparalleled versatility and performance across a wide range of DNNs and applications. We explore how these foundational principles drive innovation and shape the architectural decisions that lead to the design of FlexNN.

II-A Importance of flexibility

Refer to caption
Figure 1: Illustration of multi-loop tensor processing during convolution operation in DNN.

Numerous DNN accelerators utilize spatial architectures comprised of arrays of processing elements (PEs) alongside local storage, such as register files (RFs), for those PEs, and external storage, such as SRAM banks. In inference tasks, trained weights, or filters (FL), must be loaded into PE arrays from storage sources such as DRAMs and SRAM buffers. Input images, referred to as input activations or features (IFs), are also transferred to PE arrays, where MAC operations occur across multiple input channels (ICs) between activations and weights, generating output activations or features (OFs). Multiple sets of weight tensors (OCs) are commonly used against a specific set of activations to produce an output tensor volume. Finally, a non-linear function (e.g., ReLU) is applied to the output activations, which then become the input activations for the subsequent layer. Tensor processing involved in a convolution operation, as shown in Fig. 1, shows convolution layers comprising six nested loops. These layers generate an output tensor, OF map, from multiple kernel feature maps, FLs, operating on one or more input tensors, IF map. Each point in the output volume undergoes a MAC operation during the calculation. For instance, a 1×\times×1 convolution layer, such as the second convolution layer in ResNet50, illustrates IF map dimensions of IX = 56, IY = 56, IC = 64, and the filters dimensions of FX = 1, FY = 1, IC = 64, OC = 256. These dimensions convolve (with a batch size of 1) to produce an OF map with dimensions OX = 56, OY = 56, OC = 256, accompanied by appropriate padding values.

The dimensions of the input tensor undergo changes as they transition from one layer to another within a DNN and across various DNNs. Consequently, the development of flexible hardware accelerators becomes crucial to maintaining high utilization of compute units across network layers with arbitrary dimensions. Attempting to map various tensor dimensions to a fixed PE array with a consistent tensor map** pattern can lead to decreased array utilization. To improve performance and energy efficiency, it is imperative to minimize data movement by maximizing data reuse from local memory and improving resource utilization. This optimization is particularly vital, as the cost of memory accesses often exceeds that of computing, as illustrated in Fig. 2. Numerous existing DNN accelerators, such as Eyeriss [12], TPU [13], and SCNN [10], implement novel memory hierarchies and fixed dataflows, influencing the movement of tensors for activations and weights within the processing units and the workload assigned to each PE. A fixed dataflow constrains the types of data movement across the memory hierarchy, limiting the degree of reuse within processing units. The movement of IFs, FLs, and partial sums (psums) , as well as the order of reuse, directly impact the energy consumption for each layer. In the literature[7], inference accelerators are classified into IS, WS, OS, and RS based on dataflow. The data reuse scheme is based on loop order, loop blocking, and loop partitioning for tensor processing, collectively called a “schedule”, as depicted in Fig. 3. This schedule is described in relation to the dimensions of the tensors in a convolutional neural network. The loop order dictates the relative order of IX, IY (spatial), and IC dimensions for activations, and FX, FY, IC, OC dimensions for filters when loading these data into the accelerator. Loop partitioning dictates how the overall convolution operation is distributed among the PEs in the PE array, whereas loop blocking governs the allocation of multiple points in each dimension to the same PE.

Refer to caption
Figure 2: Relative energy costs of different compute and memory operations at various precisions in 45 nm technology[14]. Note that x-axis is in logarithmic scale.
Refer to caption
Figure 3: Illustration of (1) loop order, (2) loop partitioning and (3) loop blocking - referred to collectively as schedule - for optimizing data loading and distribution in the accelerator.

All existing inference engines operate with fixed loop orders, blocking, and partitioning for convolution operations. Consequently, each accelerator can execute only one predetermined dataflow, where the data remain stationary in a single aspect. Various schedules require that IFs, FLs, and OFs/psums be mapped and accessed from local RF storage differently, depending on the type of schedule being computed. For example, in the IS scenario, a single point within the IF RF must undergo multiplication and accumulation against multiple points in the FL RF. The frequency of this repetition of operations varies on the basis of the schedule. Similarly, in the WS situation, a single point within the FL RF must be multiplied and accumulated against multiple points in the IF RF. Lastly, for OS schedules, the same psum in the OF RF must be retained and used to accumulate the results of multiplication between distinct IF and FL RF points over multiple cycles. Furthermore, when the size of the SRAM imposes limitations on the number of IC points that can be stored, incomplete OF points in the form of psums must be transferred to a higher-level memory hierarchy (e.g., DRAM) for subsequent retrieval into PE RFs to complete OF computation across all ICs.

Previous research aimed at characterizing the energy efficiency of DNN accelerators by constructing analytical models underscores the need to introduce flexibility in scheduling tensor operations of various dimensions to maximize reuse from the innermost memory hierarchy, where the energy cost per unit of data moved is minimized [7]. However, as mentioned in Section I, fixed dataflows can only cater to optimal data reuse and resource utilization for a limited subset of DNN layers. To address this flexibility challenge, the proposed tensor data computing PE array offers a practical solution with minimal hardware overhead. Realizing a flexible dataflow accelerator requires a dataflow-aware tensor distribution unit capable of utilizing layer-specific optimal schedules and dataflow information to distribute data to the array. Moreover, the accelerator should inherently support flexible map** and execution of these data within each PE.

Motivation 1: It is important to develop a flexible dataflow in order to minimize data movement and maximize reuse in the PE array.


II-B Importance of Sparsity Acceleration

Sparse IFs are inherent in DNNs due to several factors. One primary cause is the prevalent use of ReLU as an activation function within many DNN architectures. The nature of ReLU to set negative values to zero contributes to sparsity, particularly intensifying in deeper layers, where it often exceeds 90%percent9090\%90 %. In addition, the rise of auto-encoders, generative adversarial networks (GANs), and transformers further accentuates sparsity trends. These networks have decoder layers, employing zero-insertion techniques to up-sample input feature maps, resulting in more than 75%percent7575\%75 % zeros. Furthermore, extensive efforts have focused on inducing FL sparsity within DNNs. Various criteria, such as saliency, magnitude, and energy consumption, are used to determine which weights to prune. As a result, pruned networks exhibit weight sparsity levels of up to 90% [15].

The translation of the sparsity in weights and activations into enhanced energy efficiency and processing speed presents a significant opportunity. However, designing DNN accelerators capable of effectively harnessing these characteristics remains a formidable challenge. Computation gating emerges as a promising technique for converting sparsity in both IFs and FLs into energy savings. The implementation involves recognizing whether either the weight or activation is zero and clock-gating the datapath switching and memory accesses accordingly, achieving cost-effective solutions. To optimize throughput while conserving energy, skip** cycles of processing MACs with zero weights or activations becomes desirable. Yet, this necessitates intricate read logic to locate the next non-zero value without expending cycles on zeros. A natural solution entails maintaining FLs and IFs in a compressed format indicating the next non-zero location relative to the current one. However, compressed formats, often of variable length, pose challenges for parallel processing across PEs without compromising compression efficiency. Additionally, simultaneous recognition of sparsity in both weights and activations complicates matters, as efficiently ‘looking ahead’ (e.g., skip** non-zero weights when the corresponding activation is zero) proves challenging with many compression formats. The irregularity introduced by such jumps precludes the use of pre-fetching to enhance throughput. Consequently, the control logic for processing compressed data becomes complex, adding overhead to the PEs. Addressing these complexities is vital to realize the full potential of sparsity in DNN accelerators.

As a result, hardware solutions in this domain have been limited. For example, Cnvlutin [16] exclusively facilitates skip** cycles for activations without compressing weights, while Cambricon-X [17] lacks the ability to maintain activations in compressed format. Given the intricacies involved in skip** cycles for both weights and activations, existing hardware designed for sparse processing tends to be tailored to specific layer types. For example, EIE [18] is tailored for fully connected (FC) layers, while SCNN [10] is optimized for convolutional (CONV) layers. This specialization underscores the need for further innovation in develo** versatile hardware architectures capable of efficiently handling sparsity across various layer types in diverse DNNs.

Introducing computation skip** for sparse data fundamentally alters the workload distribution across PEs, as the workload at each PE becomes contingent on sparsity levels. As the count of non-zero values fluctuates across diverse layers, data types, or even within specific regions within the same filter or feature map, it endangers an inherent imbalance in workload distribution across PEs [12]. Consequently, the throughput of the entire DNN accelerator becomes constrained by the PE processing the highest number of non-zero MAC operations. This imbalance inevitably results in reduced PE utilization, thereby impeding the overall efficiency and performance of the DNN accelerator. Addressing this challenge requires innovative strategies to optimize workload distribution and improve PE utilization, thus maximizing the potential benefits of computation skip** for sparse data.

Motivation 2: It is important to develop an acceleration logic that can leverage both unstructured IF and FL sparsity, in zero-compressed format.


III Microarchitecture Design

Refer to caption
Figure 4: Top-level schematic of FlexNN accelerator: (1) Interconnect pathways within PE array arranged in N columns, data distribution network, control block, and SRAM. (2) Architectural intricacies of each PE showing data storage (IF/FL/OF RF), sparsity storage (IF/FL SP), and MAC unit.

This section delineates the array of microarchitecture design decisions and techniques essential for implementing the proposed FlexNN accelerator.

III-A Overview of FlexNN accelerator

The high-level diagram of the DNN accelerator is shown in Fig. 4, illustrating the various microarchitectural components that facilitate flexible and reconfigurable dataflow. Although the proposed DNN accelerator accommodates any schedule, it remains preferable to configure it according to the optimal schedule for individual layers of the neural network. The optimal schedule is obtained per layer, according to existing research [19, 20]. Leveraging the regularity of DNN computations facilitates efficient data loading into the accelerator and enables flexible convolution map** based on the optimal schedule. In our design, we assume a three-level memory hierarchy. The first level comprises internal RFs within each PE. The second level consists of SRAM, similar to a small L1 cache, which stores input operands, output points, and partial sums. The third level encompasses DRAM memory, which is required because of its high capacity to store large amounts of filter weights and spilled OF points. For brevity, we omit DRAM-level implementation specifics and assume that the SRAM’s capacity is sufficient to accommodate all input, intermediate activations, and filter weights. Each memory level gives us the opportunity to reuse data, thus enhancing energy efficiency. The flexible DNN accelerator, capable of supporting INT8,U8,FP16𝐼𝑁𝑇8𝑈8𝐹𝑃16INT8,U8,FP16italic_I italic_N italic_T 8 , italic_U 8 , italic_F italic_P 16 and BF16𝐵𝐹16BF16italic_B italic_F 16 datatypes, comprises three principal components, elaborated in subsequent sections.

III-B Versatile Processing Element and Flexible Processing Element Array

The Versatile Processing Element (VPE) serves as the fundamental computational unit within the proposed FlexNN accelerator, primarily tasked with performing MAC operations between IF and FL points [21, 22, 23]. VPE also facilitates the accumulation of internal/external psums. In the context of flexible accelerator, the VPE optimizes the reuse of IF/FL/OF data and selects the most suitable compute template based on the optimal schedule for each layer. The Flexible Processing Element Array (FPA) comprises an N×\times×M array of VPEs, with the array dimension parameterized by design, typically through synthesis parameters. This array can be conceptualized as being arranged into M columns, each column consisting of N VPEs. To streamline the control logic, we deliberately adopted a square grid configuration (N×\times×N) for the FPA, using N = 16, simplifying the associated control mechanisms. Fig. 4.1 illustrates the schematic of the DNN FPA, composed of an array of VPEs that serve as the most important computational unit within the accelerator. Although this figure shows an N×\times×N control block for simplicity, each PE column is instantiated with one control unit consisting of N control blocks, where each control block is dedicated to 1 PE. Each VPE, as demonstrated in Fig. 4.2, features 4 sets of four 4R1W IF compressed data (CD) RFs to store input features (IF0RF𝐼subscript𝐹0𝑅𝐹IF_{0}RFitalic_I italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_R italic_F to IF3RF𝐼subscript𝐹3𝑅𝐹IF_{3}RFitalic_I italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_R italic_F), 1R1W FL CD RF to store weights (FL0RF𝐹subscript𝐿0𝑅𝐹FL_{0}RFitalic_F italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_R italic_F to FL3RF𝐹subscript𝐿3𝑅𝐹FL_{3}RFitalic_F italic_L start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_R italic_F), and 1R1W OF RF (OF0RF𝑂subscript𝐹0𝑅𝐹OF_{0}RFitalic_O italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_R italic_F to OF3RF𝑂subscript𝐹3𝑅𝐹OF_{3}RFitalic_O italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_R italic_F) to store output values (OF/psum). In addition, each VPE consists of 4 sets of 1R1W RFs to store sparsity bitmaps (SP BMP), namely IF SP BMP RF and FL SP BMP RF. During a typical MAC operation, the input operands are fetched from the IF and FL RFs, based on the stored bitmaps, the sparsity acceleration logic (described later in Section III-D), and the addresses generated by control and address generation (CAG) unit. The operation output is accumulated within the OF RF. Note that for stall-free high-performance execution, we introduced double buffers (active + shadow) in IF, FL, and OF RFs.

Fig. 5 shows the microarchitecture details of VPE that execute computations on IF, FL, and OF/psum tensor data based on the optimal schedule of the current layer. VPE dynamically adjusts the loading and access patterns of the IF, FL, and OF/psum tensor data within the PE RFs to maximize reuse of the tensor data. The PE’s microarchitecture is crafted to effectively utilize sparsity within both IF and FL. As illustrated in the figure, the PE comprises registers dedicated to storing sparsity data from incoming IF and FL streams, represented as bitmaps (IF SP BMP RF and FL SP BMP RF). These bitmaps are merged using a sparsity acceleration logic (further elaborated in Section III-D), resulting in a combined sparsity bitmap (CSB). This unified bitmap serves as input to the CAG unit, facilitating the generation of participating non-zero IF and FL read addresses. N control blocks in each PE column updates the configuration descriptors inside individual PE at the onset of each convolution layer based on the optimal layer schedule, guiding data redirection during load, compute, and drain operations throughout the lifetime of the input tensor data. The PE finite-state machine (FSM) guides several internal counters and logic in the CAG unit to generate read and write control signals for IF, FL, and OF RFs, along with multiplexer control signals to route data from the RFs to the appropriate arithmetic units based on the template, viz., vector-vector (V×\times×V) or matrix-matrix (M×\times×M) or operation type, viz., MAC/Eltwise/Pooling and tensor dimension.

Refer to caption
Figure 5: Microarchitecture of the FlexNN VPE demonstrating inter-connectivity among control units, sparsity acceleration logic, sparsity bitmap registers (SP), data register files (RF), and MAC units, alongside support for two accumulation orders.

Assisted by the PE FSM, internal registers within the CAG unit track the total number of PE blocks (or OF/psum points) produced, aiding in addressing IF/FL/OF RFs. Additionally, counters like ifcount, wcount, and ofcount manage the addresses/indexes for IF, FL, and OF RFs, increasing or clearing based on the number of input activations and weights required to calculate each OF point or psum block. The layer schedule determines the type and extent of IF/FL/OF RF data reuse, regulated by internal IF/FL/OF block counters controlling the loading of new IF/FL data and draining OF data each round, as per the layer’s optimal schedule. These internal structures and associated control logic are crucial to supporting flexible schedules within the VPE. The critical role of VPE in facilitating flexible blocking within the DNN accelerator is realized by dividing the RF into multiple subbanks (X) and incorporating X MACs (e.g., X = 4) alongside multiplexers, allowing the implementation of V×\times×V, V×\times×M, and M×\times×M templates [24], based on the optimal blocking factor of the layer (ICB,OCB,OXB,OYB𝐼subscript𝐶𝐵𝑂subscript𝐶𝐵𝑂subscript𝑋𝐵𝑂subscript𝑌𝐵IC_{B},OC_{B},OX_{B},OY_{B}italic_I italic_C start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT , italic_O italic_C start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT , italic_O italic_X start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT , italic_O italic_Y start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT), as shown in Fig. 6.

Refer to caption
Figure 6: Versatile Processing Element (VPE) accommodating (1) V×\times×V and (2) M×\times×M templates. In V×\times×V, accumulation involves Input Channel (IC) of the same Output Channel (OC) of weights (FL), while M×\times×M entail accumulation of ICs from different OCs. M×\times×M involves different V×\times×M each round. This illustration presents sparsity-aware flexible dataflow inside the VPE along with loop blocking and partitioning. Weight filter dimensions assumed: FX=1 and FY=1.

For specific schedules, the convolution operation can be partitioned to split across multiple VPEs based on the number of ICs. Consequently, DNN computations that generate psums across different sets of ICs for a particular OF point should be mapped to a single column or row of VPEs. External psum accumulation enables accumulation of all ICs partitioned into multiple VPEs to generate the final OF point. The FPA facilitates the transmission of psums formed within the PE to its right or top neighbor, which is essential for the psum accumulation in FPA. To mitigate wire congestion and routing complexity, interconnections between PEs are restricted to their top and right neighbors. As shown in Fig. 4.2, three multiplexers are used with control signal accum_dir selecting the neighbor, accum_Nbr and en_ext_psum selecting between external accumulation and internal MAC accumulation. Note that these multiplexers are not shown in Fig. 5 and 6 for clarity. This architectural decision inherently influences how work is partitioned among different PEs, mainly in the IC dimension.

In certain optimal schedules, all ICs are not accumulated simultaneously. Instead, a portion of the IC set is initially loaded into the PE RFs, and the computed psum is extracted to the SRAM (or even DRAM) to be brought back into the PE RFs later when the remaining ICs are accumulated. External partial sum accumulation necessitates a 32-bit wide read and write direct bypass to and from the SRAMs. Sharing arithmetic units for MAC and Eltwise computation, along with multiplexer control logic routing appropriate tensor data into these units, reduces area overhead by enhancing hardware reuse efficiency within the PE. Residual networks such as ResNet require element-wise operations, such as the addition of OFs from two convolution layers. To support such operations while maximizing hardware resource reuse, OFs from two different layers are routed into the PE, using existing load and drain paths. The Eltwise field in the programmable descriptor signals an eltwise operation, bypassing the multiply operation within the PE and performing an eltwise addition of the two IF inputs.

Illustrative Example: The FlexNN PE demonstrates flexibility by executing V×\times×V and M×\times×M MAC operations, exploiting sparsity in both scenarios, as illustrated in Fig. 6.1 and 2 respectively. The PE adapts this flexibility in its operation based on the optimal schedule selected for each specific layer. Let us first delve into the V×\times×V operation scenario, where ICs within the same OC are accumulated. In this example with ICB=32𝐼subscript𝐶𝐵32IC_{B}=32italic_I italic_C start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = 32, 32 IFs and FLs are assigned to the PE for computation, corresponding to 32 distinct ICs but belonging to the same OC (represented by a single yellow color). Since these values exhibit sparsity, the sparsity bitmaps of IF and FL are stored in the respective registers IF SP BMP RF and FL SP BMP RF. The IF select signal retrieves the bitmaps from the first IF register (IF0𝐼subscript𝐹0IF_{0}italic_I italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT) and the first FL register (FL0𝐹subscript𝐿0FL_{0}italic_F italic_L start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT), transmitting them to the two-sided combined sparsity acceleration logic, which produces CSB00𝐶𝑆subscript𝐵00CSB_{00}italic_C italic_S italic_B start_POSTSUBSCRIPT 00 end_POSTSUBSCRIPT. This logic identifies the non-zero activations and weight addresses through the CAG unit. These addresses guide the IF and FL CD RFs that store zero-compressed IF and FL values and provide precise values for MAC operations. Simultaneously, this process repeats for the other IFs (IF1𝐼subscript𝐹1IF_{1}italic_I italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to IF3𝐼subscript𝐹3IF_{3}italic_I italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT) and FL registers (FL1𝐹subscript𝐿1FL_{1}italic_F italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to FL3𝐹subscript𝐿3FL_{3}italic_F italic_L start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT), thus feeding the MACs with IF/FL points from different IF/FL RF subbanks and generating four psums concurrently, in each case, within the OF RFs. These psums aggregate to produce a single OF point upon completion of the computation. This cycle is iterated until all 32 ICs are processed. Subsequently, the next set of 32 ICs is loaded, and this process continues until all OF points for that OC are computed.

Now, let us dive into the second scenario of M×\times×M operations, focusing on the computation of ICs across different OCs within the PE. Here, IF SP BMP RFs and IF CD RFs receive bitmaps and input features that correspond to four distinct OCs, each represented by a different color. Similarly, the corresponding FL SP BMP RFs are loaded with four different FLs. During computation, in the initial round (i𝑖iitalic_i), only elements from the first IF SP BMP RF are provided as input to all CAGs along with four different FL bitmap values for the four OCs. Consequently, the CAG generates addresses solely corresponding to IF0𝐼subscript𝐹0IF_{0}italic_I italic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT following a two-sided sparsity acceleration logic (CSB00CSB03𝐶𝑆subscript𝐵00𝐶𝑆subscript𝐵03CSB_{00}-CSB_{03}italic_C italic_S italic_B start_POSTSUBSCRIPT 00 end_POSTSUBSCRIPT - italic_C italic_S italic_B start_POSTSUBSCRIPT 03 end_POSTSUBSCRIPT). Subsequently, after obtaining the participating non-zero IFs and FLs from the compressed RFs, the MAC operation partially generates each of the four OC points in the first round, each denoted by a distinct color. Notably, in this case, MACs draw input exclusively from one IF RF subbank at a time, unlike the previous scenario in which they obtained IFs from all subbanks simultaneously. In the subsequent round, IF RF subbanks are switched using appropriate MUXing logic, consequently switching the compressed IF RF bank to acquire non-zero IF points for that specific round. Thus, contingent on the optimal layer schedule, the sparse PE efficiently conducts both V×\times×V and M×\times×M operations, capitalizing on both-sided sparsity in activations and weights.

Schedule-Aware Flexible Depth Adder Tree (FlexTree)

Our FlexNN accelerator’s core features a tree-based architecture named FlexTree, designed for psum accumulation across numerous PEs within the FPA to generate the final output point [25]. The distinguishing feature of FlexTree is its ability to dynamically adapt the depth of the adder tree, allowing the compiler to create flexible schedules for network layers of varying dimensions. This hardware enhancement allows the compiler/scheduler to discover highly compute-efficient schedules. Before delving into the FlexTree architecture, it is essential to understand the concept of Input Channel Partition (ICP𝐼subscript𝐶𝑃IC_{P}italic_I italic_C start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT), similar to OF channel partition as illustrated earlier in Fig. 3. ICP𝐼subscript𝐶𝑃IC_{P}italic_I italic_C start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT denotes how many ICs are assigned to a single PE in the FPA. Consequently, this also denotes the number of PEs that participate in the partial sum accumulation. Let us elucidate ICP𝐼subscript𝐶𝑃IC_{P}italic_I italic_C start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT using an example of 64 ICs. When ICP=1𝐼subscript𝐶𝑃1IC_{P}=1italic_I italic_C start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT = 1, the computation involves only one PE, denoted PE1. All 64 ICs undergo pointwise multiplication and accumulation within PE1, producing the final output. When ICP=2𝐼subscript𝐶𝑃2IC_{P}=2italic_I italic_C start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT = 2, 64 ICs are evenly divided between PE1 and PE2, each processing 32 ICs. PE1 accumulates psums from channels 0 to 31, while PE2 accumulates those from channels 32 to 63, forming the final output collectively. Similarly, for ICP=4𝐼subscript𝐶𝑃4IC_{P}=4italic_I italic_C start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT = 4, the channels are distributed in PE1, PE2, PE3, and PE4, each PE handling 16 ICs. These psums of the respective sets of ICs are accumulated within each PE to generate the final output. Essentially, ICP×ICB=IC𝐼subscript𝐶𝑃𝐼subscript𝐶𝐵𝐼𝐶IC_{P}\times IC_{B}=ICitalic_I italic_C start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT × italic_I italic_C start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = italic_I italic_C.

Fig. 7 illustrates the FlexTree architecture, which receives 16 inputs from the 16 PEs within a column of the PE array in the DNN accelerator. ICP𝐼subscript𝐶𝑃IC_{P}italic_I italic_C start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT supported by the adder tree network ranges from 1 to 16, inclusively. Even if ICP=2𝐼subscript𝐶𝑃2IC_{P}=2italic_I italic_C start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT = 2, the output of the computation must still pass through the adder tree network before producing the final OF output. This ensures a reduction in hardware overhead by simplifying hardware design and achieving uniformity across all ICP𝐼subscript𝐶𝑃IC_{P}italic_I italic_C start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT values. It is noteworthy that our FlexTree architecture can accommodate ICP𝐼subscript𝐶𝑃IC_{P}italic_I italic_C start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT values that are not powers of 2 by entering zeros into the FlexTree network of PEs that do not align with powers of 2. Each module marked with a ‘+’ sign comprises both the INT8 adder and the FP16 adder to support convolution layers of different precision [26]. Depending on the input precision (INT8 vs. FP16), the psum output from the PEs is routed to the appropriate hardware resource within FlexTree. In Fig. 7, for ICP𝐼subscript𝐶𝑃IC_{P}italic_I italic_C start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT values of [1, 2], the flops [A, B, C, D, E, F, G, H] at level 1 serve as the final OF output tap points. For ICP𝐼subscript𝐶𝑃IC_{P}italic_I italic_C start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT = [4], the flops [I, J, K, L] at level 2 act as the final OF output tap points. Similarly, for ICP𝐼subscript𝐶𝑃IC_{P}italic_I italic_C start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT values of [8] and [16], the flops [M, N] at level 3 and [O] at level 4, respectively, serve as the final OF output tap points. Therefore, the total number of FlexTree output tap points varies for different ICP𝐼subscript𝐶𝑃IC_{P}italic_I italic_C start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT values. Therefore, for ICP𝐼subscript𝐶𝑃IC_{P}italic_I italic_C start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT values of [1, 2, 4, 8, 16], the total number of FlexTree output tap points is [8, 8, 4, 2, 1], respectively. To simplify the extraction of final OF points from the FlexTree module into the drain module, we allow a maximum of four OF points to be extracted from FlexTree in one round. The figure illustration assumes IC=64,ICP=16formulae-sequence𝐼𝐶64𝐼subscript𝐶𝑃16IC=64,IC_{P}=16italic_I italic_C = 64 , italic_I italic_C start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT = 16, and therefore Port O is active.

Refer to caption
Figure 7: FlexTree architecture details with illustration using 64 input channels and 16 input channel partition factor.

As is evident from the above discussion, FlexTree achieves dynamic reconfiguration of the depth of the adder tree. This configurable feature is aided by software-programmable configuration registers. Unlike existing DNN accelerators where partial sum accumulation occurs by moving psums among neighboring PEs, FlexTree’s innovative tree-based architecture significantly enhances partial sum accumulation efficiency (up to 2.14×2.14\times2.14 × speedup). In contrast to state-of-the-art DNN accelerators with fixed schedules and adder tree-based architectures, where the adder tree depth remains fixed at design time, our FlexTree technique offers dynamic reconfiguration capabilities, achieving speedups of up to 4×4\times4 ×16×16\times16 ×, across seven DNNs, namely, ResNet50, GoogleNet, InceptionV2, MobileNetV2, MobileNetV3, SqueezeNet1.1 and MobileNet_SSD. Thus, our proposed FlexTree architecture enhances compute efficiency by allowing superior psum accumulation techniques across a wide range of layers found in modern DNNs.

III-C Schedule-aware Tensor Distribution Network (SDN)

The Schedule-aware Tensor Distribution Network (SDN) serves as one of the fundamental architectural innovations of the proposed FlexNN accelerator, tasked with efficiently transferring input data between on-chip memory (SRAM) and flexible PE array, and vice versa, adhering to the optimal layer schedule [27]. These data include configuration settings, activation and kernel data (IF & FL), sparsity encodings, as well as bias & scale factors essential for calculation within the PE array. Additionally, the SDN manages the transportation of computational results, including output activation (OF) and partial sums, from the PE array’s internal storage structure (RF) back to the SRAM, ensuring that the layout facilitates subsequent tensor layer acceleration. During the operational phases, the input side of the distribution network is termed the “load/fill” phase, while the output side is termed the “drain” phase. In fixed hardware accelerators, the pre-determined data layout in SRAM simplifies the load and drain phases, but compromises flexibility and optimization in operations. This rigidity restricts reuse potential, escalates memory accesses, and significantly increases overall energy and power consumption. Flexible hardware demands dynamic changes in the SRAM data layout, contingent on the type of reuse and optimal schedule (blocking and partitioning) for the layer.

III-C1 Load Path

The usual design consideration revolves around simplifying one of load or drain phase, while the other phase manages the complexities associated with rearranging the data to adhere to the optimal schedule. When the SRAM data layout remains fixed, the loading process must handle the complexity associated with unpacking the fixed layout data and arranging them according to the predetermined order and sequence dictated by the optimal schedule. Furthermore, the loading process must be hierarchical: initially organizing the input in a manner consumable by a column of PEs based on the partitioning factor and then, within the column, determining which input byte corresponds to which PE based on the blocking factor through a series of multiplexers [28, 29]. For activation data, this involves retrieving data from memory in a predetermined order and distributing the IX, IY, and IC in the sequence and quantity specified by the reuse factor of the optimal schedule. Throughout the reuse process, one set of input remains resident, while the other circulates multiple times. Typically, the optimal schedule strives to maximize reuse, thereby reducing the frequency of fetching from SRAM. Ideally, a fully flexible DNN accelerator would allow partitioning in both the incoming activation and weights. However, this approach introduces a significant rise in MUXing complexity, along with its accompanying overheads, resulting in a convoluted routing process in the load, circular buffer, and PE FSM. To mitigate these challenges, we restrict weight partitioning within column and activation partitioning across column of the PE array. This strategy aims to streamline routing complexities and enhance operational efficiency within the accelerator architecture.

Refer to caption
Figure 8: FlexNN accelerator load path. Compressed activations (IF) and weights (FL) along with sparsity bitmaps are fetched from SRAM and delivered to PE array via the NoC interfaces. Flexible schedule support is integrated in Load FSM, Circular Buffer FSM, and PE FSM. OF NoC (part of drain) not shown inside PE array for clarity.

Fig. 8 illustrates the crucial load path of FlexNN, spanning from SRAM to PEs. Central to this pathway is the load FSM, which interfaces with SRAM, the circular buffer FSM, sparse byte select modules, and PE columns. Activation of the load FSM is initiated by a start𝑠𝑡𝑎𝑟𝑡startitalic_s italic_t italic_a italic_r italic_t signal, indicating that all configuration register values have been appropriately set by the schedule descriptor based on the optimal schedule. Optionally, this start𝑠𝑡𝑎𝑟𝑡startitalic_s italic_t italic_a italic_r italic_t signal can also serve as the reset input for the load FSM, ensuring the removal of any outdated register values from previous layers. Once the FSM is active and space becomes available within the circular buffer, the FSM transmits fetch address and bank select signals to SRAM, and the IF and FL NoCs commence the transmission of IF and FL data to their respective circular buffers. Concurrently, each PE within the PE columns returns credits to the load FSM as space becomes available in the any of the double-buffered IF/FL data RF. Upon receipt, the FSM directs read requests to the circular buffer FSM. Within the circular buffer FSM, metadata is managed and data are dequeued in response to load FSM requests, signaling when the buffer is empty.

In cases where valid compressed IF/FL data and corresponding sparsity bitmaps are present in the buffers, the circular buffer FSM initiates bitmap transmission to the sparse byte select module and compressed IF/FL data transmission to the IF/FL multiplexer (mux) array. The load FSM calculates logical byte select signals based on current load counter values and configuration registers. These signals are utilized by the byte select modules, along with sparsity bitmaps, to determine physical byte select signals, which control the IF and FL mux arrays for data routing, accounting for potential compression. These mux arrays facilitate data routing between the circular buffers and the column buffers associated with each PE column. It is important to note that when communicating with the PE, control signals pass through the distribution network, utilizing a specific number of staging buffers to meet timing requirements. The Load FSM remains active, fetching data until the entire load volume is processed, at which point clock gating is initiated.

Another critical aspect of the distribution network is the interconnect or Network-on-Chip (NoC) used to link the PE array with the drain and load blocks in the design. NoC must have the ability to unicast, multicast, or broadcast the input data to one or more PEs based on the order specified by the optimal schedule, as demonstrated in Fig. 9. This maximizes reuse and minimizes the number of accesses to SRAM, improving overall efficiency, as defined in existing research [30]. As shown in Fig. 8, when valid compressed data are loaded into column buffers, IF and FL NoCs distribute data to the appropriate PEs.

Refer to caption
Figure 9: Data distribution patterns through flexible NoC.

III-C2 Drain Path

One of the core innovations within the FlexNN architecture is FlexDrain, an efficient framework for processing OF maps across various schedules. systematically drains OF maps, specifically tailored for flexible schedule-based DNN accelerators. Focusing on MAC operations along the IC dimension, the fixed drain pattern ensures consistent extraction of OF points in an IC-major fashion, regardless of the current layer schedule. This design choice capitalizes on the understanding that sparse compression in DNN accelerators predominantly occurs along the IC dimension. Implementing this fixed draining methodology simplifies drain design, integrating schedule awareness into load logic with minimal overhead. This novel approach holds promise for advancing DNN accelerators, enhancing reuse, reducing memory traffic, and improving energy efficiency.

The FlexDrain datapath encompasses several agents distributed across the PE compute array subsystem. Fig. 10 provides a high-level depiction of these components and their respective functions. The components constituting the drain data path are: 1) Local Drain (LD): Instantiated on a per-column basis, the LD is responsible for extracting the output activation or psums from the PE, facilitating their transfer to the Super Column Drain Concatenator (SCDC). 2) SCDC: Implemented on a per-super column basis, the SCDC is tasked with concatenating data from the output column buffers of all 4 columns within a super column using psum-NoC. Subsequently, it transmits this concatenated data to the Global Drain (GD) via the super-column NoC. 3) GD: Serving as a central agent, it plays a pivotal role in rearranging SCDC data in a 1×\times×1×\times×Z manner, which further encodes/compresses these data and writes them to the SRAM.

Refer to caption
Figure 10: FlexNN accelerator local drain path, showing (1) local drain path routing output activations/partial sums from each PE inside a PE column to column buffer. Group of 4 columns organized into super column. (2) SDSC per super column routes data from local drain to global drain.

Local Drain: The Local Drain (LD) operates to extract output activation data from the PEs within a column, forwarding them to the Post Processing Modules (PPMs), and then directing the PPM outputs to the column’s output buffer, subsequently routed to the centralized GD. An overview of the local drain datapath is depicted in Fig. 10. The accompanying block diagram illustrates the various components of the Local Drain at a high level, each of which will be elaborated upon in subsequent sections. As previously outlined, each PE can generate up to 16 OF points per round for a given set of input data, with the exact count contingent upon the layer and input tensor parameters. Upon readiness, these OF points transition from the active OF RF to the shadow OF RF. Following this transfer, the PE signals to the Accumulate Finite State Machine (AccumFSM), a part of local drain FSM, that the associated Local Drain is primed to extract the OF points from the PEs.

Flow of Control: Based on the layers and tensor parameters configured in the registers, it is possible to determine whether AccumFSM needs to consume data for accumulation across PEs, particularly if ICs are distributed across different PEs. In such scenarios, the LD waits for the AccumFSM to complete processing these OF points before proceeding. When AccumFSM is active, the LD streamlines the flow, refraining from extracting OF points until the accumulation is complete. In contrast, when AccumFSM is not required prior to LD operations, the PEs trigger the extraction process to the LD. The extraction sites and the number of points in each PE with a valid OF point to be extracted are determined using configuration registers. This information guides the sequential extraction of OF points from the shadow OF RF. Upon completing the extraction from the shadow OF RF, the LD indicates to the PEs that the shadow OF RF is fully utilized and prepared for the next round of transfers from the active OF RF.

OF Select to PPM: The OF outputs accumulated from the FlexTree flexible adder architecture (Section III-B), multiplexed (MUXed) using a 15:4 MUX, are fed into each PPM. LD orchestrates the selection of inputs for the psum-MUX in a round-robin manner, facilitating the transfer of input data into the PPM. The functionality of the PPM, primarily used for activation functions, quantization, etc., is configured via its bank of configuration registers, which serve as the foundation for processing the input data and selecting biases/scales. The LD assumes the responsibility of steering the data path for the PPM, issuing input data alongside bias/scale values, and subsequently extracting the output data to feed into the output column buffer. Note that the PPM module can handle both integer and floating-point precision.

OF Rearranger: The PPM data output links to the column OF buffer entries through a 4:1 DeMUX, configured by LD FSM based on the drained OF point context. LD directs the PPM output to the appropriate buffer entry. In layers where certain PEs yield no OF points, LD ensures that 0 values populate the corresponding buffer entries, facilitating seamless data drainage by GD. For floating-point cases, data must be outputted in high-low pattern for seamless processing by both GD and Sparse Encoder.

Taking into account the area and performance specifications of the accelerator, it has been established that each column, comprising 16 PEs, should integrate 4 PPMs. This configuration includes 4 INT PPMs and 4 FP PPMs, activated when FPMACs are enabled. Each of these PPMs is exclusively allocated to serve 4 PEs, ensuring optimized resource utilization and efficient processing capabilities.

Super Column Drain Concatenator (SCDC): The Super Column Drain Concatenator (SCDC) plays a pivotal role in consolidating data from the output column buffers of all 4 columns within a super column and forwarding it to the centralized GD via the Super Column NoC (SC-NoC). Each output column buffer within a column is 4 bytes wide and 16 entries deep. Once all round-required entries are filled in individual column buffers, LD in each column transmits 16 bytes of data from its column buffer to the SCDC through the psum agent packet NoC (psum-NoC) per round. SDSC combines these 4×\times×16B values into 64B data. A 2-bit super-column ID (SCID) is appended to create a 514-bit data packet for GD, which is subsequently transferred to the GD via the SC-NoC. Thus, the SDSC serves as a crucial link between the LD and the GD. Note that when PPM is enabled, each generated output activation is 1 byte for INT8 precision or 2 bytes for FP16/BF16 precision. In both cases, data from the column buffer is dispatched to the SCDC in 16-byte chunks.

Refer to caption
Figure 11: FlexNN accelerator global drain path, showing (1) drain staging buffer, (2) global drain mux network, (3) drain buffer banks, (4) drain address generation unit, (5) sparse encoders, and (6) write combining buffers. Collectively, these units drain uncompressed data from local drains PE array and writes compressed data to SRAM.

Global Drain: The Global Drain (GD), as demonstrated in Fig. 11, serves as the central hub within the PE array subsystem, tasked with gathering output activation points from PEs across all columns. Its primary function involves rearranging these activations into a 1×\times×1×\times×Z format, where Z represents the OC dimension for the current layer (or the IC dimension for the subsequent layer). Subsequently, the GD encodes these activations and writes them to the SRAM for further processing or storage. The GD comprises the following components: 1) A 256B input buffer called the Drain Staging Buffer (DSB) where the output activations from all PEs are staged. 64B data from each Super Column are transmitted to the GD via the SC-NoC, and concatenated into the DSB, thereby generating 256B for processing. 2) Global Drain Mux (GDM) network consisting of 4 sets of multiplexer arrays that rearrange the staged output activations from DSB into the Drain Banks. 3) 64 Drain Banks (DB) organized as 4 groups of 16 entries, with each Drain Bank of size 16B, for a total of 1KB. These buffers serve as a pre-final staging area for output activations from PEs before encoding and writing to SRAM. Each 16B bank can hold 16 OCs for the next layer, controlled by GDM for writing and DAGU for reading. 4) Drain Address Generator Unit (DAGU), which computes the x,y,z𝑥𝑦𝑧{x,y,z}italic_x , italic_y , italic_z coordinates of the points in the DBs that will be written into the SRAM. The read signals drive a drain bank reader (not shown in figure) to get the data out of the DBs. 5) A bank of four Sparse Encoders (SE), which encode the data to be written to the SRAM. 6) 4 Write combining buffers that allow the writing of compressed data and the corresponding sparsity bitmaps into the SRAMs. Among the six components, the Global Drain Mux and Sparse Encoder are the most important elements of the GD. Detailed explanations of these components are provided below.

Global Drain Mux: The role of the GD Mux control logic is to manage the selection process across the various mux stages to drain data from the DSB. There are a total of four GDMs, each capable of independently accessing DSB entries but restricted to writing solely to its designated group of DBs. Let us dive into each stage of the GDM: Stage S1: Entry Select: This stage involves choosing 16B of the DSB data. Following the configuration registers, the GD Mux control logic adopts a row-wise selection approach from the DSB, utilizing a 16:1 entry selects Mux capable of picking any of the 16 row entries. Stage S2: Bank Select: The 64 drain banks (DBs) are grouped into four sets, each containing 16 entries. The bank select function determines to which of the 16 entries the data will be written to. Notably, data can be written to multiple entries, potentially all 16 entries in an extreme scenario. The bank select or bank enable ensures proper multicasting of data to the DBs. Stage S3: Right Rotator: After each GDM multicasts the selected DSB entry to one or more DBs, it assigns an appropriate right rotation value specific to each DB. This step is crucial to align and concatenate consecutive Zs within a single DB. Stage S4: Byte Enable: Finally, byte enable serves as the write byte enable, ensuring that the correct set of bytes from the selected DSB line is written to the DB.

Refer to caption
Figure 12: Sparse encoder [11] in global drain performing zero-valued compression with illustration.

Sparse Encoder: This is a pivotal element within the global drain, crucial for leveraging data sparsity to enhance the speed of inference processing in FlexNN. Fig. 12 provides a schematic representation of the SE block. Its primary function is to compress dense data streams by discarding zero values, thereby outputting a compressed data representation accompanied by a sparsity bitmap. The drain buffer acts as a staging area for the data before they are written to SRAM, providing the SE with input. Each DB bank is allocated to store data corresponding to a unique output context (OX, OY) point, with a 16B payload containing the OC data for that specific OX, OY coordinate pair. A single context, representing one data stream, may extend across multiple banks and up to 16 contexts can be processed concurrently within a group. The GDM ensures that banks flagged with valid bits, indicating that their data has yet to be processed by the SE, are protected from being overwritten. The SE itself operates on 16B granularity, compressing the data for each distinct context contained within the DB. The degree of sparsity dictates the number of input lines that the SE must handle before it can produce a compressed 16B output line for a particular context. Alongside the compressed data, the SE generates a unique sparsity bitmap for each context, which is then sent to SRAM through the OF-NoC. The address for writing the bitmap to SRAM is determined by DAGU. As elements within a context stream may be received over several cycles, the SE is designed to manage context switching efficiently by preserving the state of each context and retrieving it when necessary to continue processing.

III-D Two-sided Sparsity Acceleration

In the proposed FlexNN accelerator, our aim is to harness the sparsity in both FLs and IFs to enhance not only energy efficiency but also DNN inference throughput. Throughout the accelerator, the data remains compressed until reaching the PE. Operating within the compressed domain offers advantages, such as reducing on-chip bandwidth requirements and storage demands. However, handling compressed data, which often varies in length, poses challenges in terms of data manipulation, such as distributing data across PEs and implementing sliding window processing within the PE [31]. In this section, we will introduce an innovative two-sided sparsity acceleration logic capable of processing sparse data within the compressed domain to achieve higher throughput [32, 33, 34, 35]. This logic spans multiple units including VPE, load path, and drain path.

The core idea is that IFs or FLs with a value of 00 do not contribute to non-zero outcomes during MAC operations, allowing them to be skipped during both the compute and storage phases [36]. As explained in Section III-C1, SRAM serves as storage for zero-value compressed input activations (IF) and weights (FL), which are delivered to each column buffer in batches through the load path in SDN [37, 33]. The PE FSM then transmits the compressed IF and FL to their respective buffers (CD RF) in each PE. Along with these, the corresponding bitmaps are also transferred to IF and FL sparsity bitmap buffers, respectively.

As illustrated in Fig. 5, the two-sided sparsity acceleration module receives the bitmaps as input. The bitmaps consist of 1-bit values, represented by either ‘0’ or ‘1’, instead of 8-bit values present in the original IF and FL sets (considering an 8-bit quantized network). For every non-zero element in IF, the corresponding position in the activation bitmap consists of ‘1’. The bitmap consists of a ‘0’ for every zero value in the incoming IF set. The FL sparsity bitmap is also generated in an identical way. Subsequently, through a series of combinational operations, it determines the combined sparsity bitmap indicating exact non-zero positions in IF and FL. The total number of 1111s in this intermediate bitmap depicts the total number of activation and weight pairs that need to be computed in the MAC unit and result in non-zero partial sums, which must be accumulated over time. Through the CAG unit, these values are then fed from the IF and FL RF into the MAC unit, which generates the OF maps, as demonstrated in Fig. 13. Subsequently, these feature maps flow through the drain path, which incorporates a zero-value compression module (in SE) to compress the zero-valued elements. These compressed feature maps are then stored back in the SRAM (or DRAM) for further processing in subsequent layers of the DNN. Through the combined sparsity acceleration logic, FlexNN achieves enhanced computational speed, improving performance, and throughput for DNN inference by exploiting two-sided sparsity. This approach significantly decreases the energy consumption of the PE array, completing tasks with a reduced number of cycles.

Refer to caption
Figure 13: Two-sided combined sparsity acceleration logic.

IV Experimental Setup

Refer to caption
Figure 14: Workflow overview for FlexNN implementation. Please refer to text for full-form of abbreviations.
TABLE I: Comparison of FlexNN with state-of-the-art fixed schedule accelerator designs. Identical memory hierarchies and cost ratios are used for evaluation.
Eyeriss [5] TPU [6] FlexNN
Memory Hierarchy

3-level\parnoteEyeriss has additional inter-PE communication with RF:PE=1:2 cost ratio

3-level

3-level

Num of PEs

168

256

256

RF (in each PE)

512 B

32 B

208 B

On-chip Buffer/SRAM\parnoteSRAM sub-bank size remains constant for all

108 KB

64 KB

1.5 MB

DRAM

1 GB

28 MB

1 GB

Energy cost ratio (PE:RF:SRAM:DRAM)

1:1:6:200

1:0.06:6:200

1:0.125:6:200

\parnotes

To evaluate the efficiency of the proposed accelerator, we implemented FlexNN in Chisel 3.0, with the generated RTL simulated in Synopsys VCS®. We chose Chisel because of its ability to facilitate the generation of parametrizable designs featuring multiple variations of PE, allowing easy modification of RF size, number of MAC units, etc. As mentioned in Section III-A, the accelerator supports UINT8, INT8, FP16, BF16 precision. Subsequently, the RTL undergoes synthesis in the Synopsys Design Compiler (DC), utilizing one of the industry’s most advanced process technology nodes, (based on 7 nm), to generate the Gate-Level Netlist (GLS) and corresponding area for each accelerator component. To estimate the power consumption within the proposed FlexNN accelerator, we employed Synopsys Verdi to generate an activity file (Switching Activity Interchange Format: SAIF), using test benches for assistance. The accelerator netlist, coupled with the activity file, serves as input to Synopsys PrimeTimePX (PTPX), enabling power estimation at the gate level for both block and full-chip designs of the FlexNN accelerator. An overview of this workflow is illustrated in Fig. 14. The FlexNN architecture comprises a unified tile of 256 PEs organized in a 16×16 grid (16 columns with each column having 16 individual PEs), featuring 8 MAC units within each PE, resulting in a total of 2048 MACs. This tile encompasses 1.5 MB of SRAM equipped with 32-byte read/write ports. The PE consists of 4x16 B IF Data RF Register File (RF), 4x16 B FL Data RF, and 16x4 B OF RF. In addition, each PE also consists 4x2B IF sparsity bitmap RF and 4x2B FL sparsity bitmap RF, which is 1/8th the size of data RF as 1 bit in bitmap is used to represent 1 byte in data. Together, these RFs contribute to 208B RF per PE. The precision of the IF, FL, and OF points is an 8bit integer. The memory hierarchy of our design is illustrated in Table I. Operating at a frequency of 1.8 GHz and 0.75 volts, the accelerator boasts a dense peak Trillion Operations Per Second (TOPS) performance, reaching 7.37 TOPS, with efficiency metrics of 5 TOPS/watt and 4.6 TOPS/mm22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT.

We conducted a comparative analysis of the performance of FlexNN in conjunction with two state-of-the-art dense accelerators, namely Eyeriss [9] and TPU [13]. The comparison considered various design specifications described in Table I. Furthermore, we evaluated the performance of FlexNN on sparse DNN workloads using state-of-the-art networks [38]: ResNet50, MobileNetV2, InceptionV3, and GoogLeNet trained on the ImageNet dataset [39]. The first three models were compressed using (i) Quantization-Aware Training (QAT) to quantize weights/activations to INT8 precision and (ii) unstructured pruning using the regularization-based sparsity algorithm (RB-sparsity). GoogLeNet was quantized in the same way, but filter pruning with geometric median criterion was applied. The compressed models were obtained from Intel’s Neural Network Compression Framework (NNCF)[40]. Per-layer and overall network weight sparsity were obtained from these models. Furthermore, all models were subjected to inference on the entire ImageNet2012 validation dataset (50,000 images) and activation sparsity at input and output of each layer was calculated using PyTorch hooks. The average activation sparsity across the entire dataset, weight sparsity, and layer statistics were fed into a framework of FlexNN, which was used to obtain the layer-wise and overall network compute acceleration and total energy consumption of the accelerator, reported in Section V. Specifically, we compared the performance of FlexNN, which uses two-sided combined sparsity, against dense accelerators without any sparsity support and those capable of exploiting fixed weight-sided sparsity [9, 41]. The framework was modified to evaluate the latency and energy of a dense variant and a weight-sided variant of FlexNN to allow fair comparison.

V Experimental Results

In this section, we begin by providing a breakdown of the power and area consumption for our proposed FlexNN accelerator. Subsequently, we proceed to evaluate its performance using state-of-the-art network and dataset configurations.

V-A FlexNN Power and Area Results

We assess the power and area cost of the proposed FlexNN accelerator using an illustrative implementation in this paper. Fig. 15.1 and 2 show the power and area breakdown of the entire accelerator as well as the inter-PE breakdowns, respectively. As shown, the PE array unit consumes 83%absentpercent83\approx 83\%≈ 83 % of power and 86%absentpercent86\approx 86\%≈ 86 % of the total area of the overall accelerator design. Furthermore, the MAC operation constitutes about 46%percent4646\%46 % of the power and 54%percent5454\%54 % of the area inside each PE of the FlexNN accelerator. This shows reasonable power and area impact compared to the significant benefits it provides in terms of the ability to support flexible schedules.

V-B Comparison with SOTA Fixed Schedule Accelerators

Fig. 16 shows the improvement in energy efficiency of our flexible schedule DNN accelerator FlexNN over two prominent fixed-schedule designs, Eyeriss [5] and TPU [6] assuming identical memory hierarchies. These results are obtained for two DNNs used in image classification and object detection, ResNet101 and YOLOv2, using our custom DNN accelerator energy estimation framework. We have used dense models (i.e., with 0 weight sparsity) for these results. Note that we have scaled the memory hierarchy of the two accelerators to the same level as FlexNN for a fair comparison. In this figure, Here, the y-axis represents a % reduction in energy consumption of FlexNN compared to these two designs. The left subplot depicts the layer-wise energy reduction for all layers, sorted in increasing order of reduction. In the right subplot, we summarize the distribution of reduction across all layers. The x-axis shows the two comparative accelerators. Compared to Eyeriss, FlexNN results in 40%percent4040\%40 %77%percent7777\%77 % reduction for ResNet101 and 45%percent4545\%45 %77%percent7777\%77 % for YOLOv2. Compared to TPU, FlexNN provides up to 62%percent6262\%62 % and 58%percent5858\%58 % energy savings for ResNet101 and YOLOv2, respectively. While it is true that in certain layers, FlexNN exhibits a slight energy increase (indicated by a negative energy reduction) compared to TPU, this is primarily attributed to the optimized dataflow in TPU for specific layers, particularly 20 layers in ResNet101 and 4 layers in YOLOv2. However, on average, FlexNN still provides notable advantages, offering average energy savings of 14%percent1414\%14 % and 22%percent2222\%22 % for these respective DNN architectures over TPU. It is important to note that the increased energy consumption in these layers comes from the robust support for flexibility within DNN, which inherently introduces slightly higher overhead. On the other hand, we see an average improvement of 57%percent5757\%57 % and 69%percent6969\%69 % over Eyeriss. Despite occasional spikes in energy consumption for select layers, FlexNN consistently outperforms these fixed-schedule accelerators, showcasing its superior efficiency and overall cost-effectiveness.

Refer to caption
Figure 15: (1) Power and (2) Area Breakdown of FlexNN for overall accelerator and PE level granularity.

V-C Sparsity Benefits using FlexNN

In this section, we present a comprehensive analysis of the layer-wise and overall network speed-up achieved by FlexNN compared to two prominent counterparts: a dense accelerator without any sparsity acceleration support and a fixed weight-sided sparse accelerator. Fig. 17.1-4 presents the layer-wise compute acceleration (y-axis) provided by weight-sided and FlexNN in comparison with the dense accelerator, for few representative layers (x-axis) of 4 DNN benchmarks. Note that the activation sparsity numbers reported in the following discussion are averaged across the entire dataset. For a fair comparison, benchmarking was performed using the same optimal schedule for all accelerator types.

V-C1 ResNet50

The sparse ResNet50 model has 5%percent55\%5 %88%percent8888\%88 % unstructured weight sparsity, weight_splayer𝑤𝑒𝑖𝑔𝑡_𝑠subscript𝑝𝑙𝑎𝑦𝑒𝑟weight\_sp_{layer}italic_w italic_e italic_i italic_g italic_h italic_t _ italic_s italic_p start_POSTSUBSCRIPT italic_l italic_a italic_y italic_e italic_r end_POSTSUBSCRIPT, resulting in up to 8.1%percent8.18.1\%8.1 % acceleration across layers in weight-sided accelerator. However, except before the first conv layer, ResNet50 has a high activation sparsity, act_splayer𝑎𝑐𝑡_𝑠subscript𝑝𝑙𝑎𝑦𝑒𝑟act\_sp_{layer}italic_a italic_c italic_t _ italic_s italic_p start_POSTSUBSCRIPT italic_l italic_a italic_y italic_e italic_r end_POSTSUBSCRIPT, at the input of every convolution layer due to the presence of the ReLU activation function. On average across the entire ImageNet validation dataset, this amounts to act_splayer𝑎𝑐𝑡_𝑠subscript𝑝𝑙𝑎𝑦𝑒𝑟act\_sp_{layer}italic_a italic_c italic_t _ italic_s italic_p start_POSTSUBSCRIPT italic_l italic_a italic_y italic_e italic_r end_POSTSUBSCRIPT = 14%percent1414\%14 %83%percent8383\%83 % sparsity. FlexNN conveniently leverages both weight and activation sparsity to provide up to 10.3%percent10.310.3\%10.3 % compute acceleration, as shown in Fig. 17.1. Overall, FlexNN gives up to 3.1×3.1\times3.1 × better acceleration than the weight-sided accelerator for ResNet50.

V-C2 GoogLeNet

Since GoogLeNet was filter-pruned, maximum weight_splayer𝑤𝑒𝑖𝑔𝑡_𝑠subscript𝑝𝑙𝑎𝑦𝑒𝑟weight\_sp_{layer}italic_w italic_e italic_i italic_g italic_h italic_t _ italic_s italic_p start_POSTSUBSCRIPT italic_l italic_a italic_y italic_e italic_r end_POSTSUBSCRIPT = 30%percent3030\%30 %. This contributed to maximum 1.4×1.4\times1.4 × speed-up in weight-sided accelerator. In contrast, the maximum measured act_splayer𝑎𝑐𝑡_𝑠subscript𝑝𝑙𝑎𝑦𝑒𝑟act\_sp_{layer}italic_a italic_c italic_t _ italic_s italic_p start_POSTSUBSCRIPT italic_l italic_a italic_y italic_e italic_r end_POSTSUBSCRIPT = 91%percent9191\%91 % resulted in a maximum acceleration 10.8×10.8\times10.8 × in FlexNN. Fig. 17.3 shows that FlexNN provides up to 7.7×7.7\times7.7 × better compute acceleration compared to fixed weight-sided accelerator, even for networks with low weight sparsity.

V-C3 InceptionV3

This model is very sparse with a maximum weight_splayer𝑤𝑒𝑖𝑔𝑡_𝑠subscript𝑝𝑙𝑎𝑦𝑒𝑟weight\_sp_{layer}italic_w italic_e italic_i italic_g italic_h italic_t _ italic_s italic_p start_POSTSUBSCRIPT italic_l italic_a italic_y italic_e italic_r end_POSTSUBSCRIPT = 96%percent9696\%96 %. There are many layers with large dimensions and filter sizes; therefore, both the weight-sided accelerator and FlexNN can leverage weight sparsity and provide up to 24.7×24.7\times24.7 × speed-up for Mixed.7a.branch3x3.2.conv, Layer Id: 72 (not shown in figure). Although act_splayer𝑎𝑐𝑡_𝑠subscript𝑝𝑙𝑎𝑦𝑒𝑟act\_sp_{layer}italic_a italic_c italic_t _ italic_s italic_p start_POSTSUBSCRIPT italic_l italic_a italic_y italic_e italic_r end_POSTSUBSCRIPT for this layer is 78%percent7878\%78 %, FlexNN cannot provide any additional speed-up for this layer. However, there are many other layers with activation sparsity higher than weight sparsity, allowing FlexNN to leverage both. As depicted in Fig. 17.4, FlexNN provides a high level of compute acceleration. Among such layers with sparsity skewed toward activations, the maximum speed-up is 11.3×11.3\times11.3 ×. Therefore, the proposed design can take the best of both worlds and give better savings than the weight-sided accelerator. Across all layers, FlexNN is up to 4.3×4.3\times4.3 × faster than the weight-sided accelerator, clearly demonstrating superior performance.

V-C4 MobileNetV2

MobileNetV2 is a compact and lightweight model compared to the other benchmarks discussed earlier. Although sparse MobileNetV2 consists up to 70%percent7070\%70 % weight_splayer𝑤𝑒𝑖𝑔𝑡_𝑠subscript𝑝𝑙𝑎𝑦𝑒𝑟weight\_sp_{layer}italic_w italic_e italic_i italic_g italic_h italic_t _ italic_s italic_p start_POSTSUBSCRIPT italic_l italic_a italic_y italic_e italic_r end_POSTSUBSCRIPT, the maximum speed-up provided by the weight-sided accelerator is only 3.3×3.3\times3.3 × (last linear layer). Interestingly, the weight sparsity of all conv layers, except features.18.0, Layer Id: 51 is <50%absentpercent50<50\%< 50 % leading to a low overall speed-up. However, FlexNN leveraging activation sparsity (maximum act_splayer𝑎𝑐𝑡_𝑠subscript𝑝𝑙𝑎𝑦𝑒𝑟act\_sp_{layer}italic_a italic_c italic_t _ italic_s italic_p start_POSTSUBSCRIPT italic_l italic_a italic_y italic_e italic_r end_POSTSUBSCRIPT = 74%percent7474\%74 %) in addition to weights can provide up to 4.1×4.1\times4.1 × acceleration. Fig. 17.2 indicates that even for compact models with small layer sizes, FlexNN is superior to the weight-sided accelerator by 3.9×3.9\times3.9 ×.

Refer to caption
Figure 16: Layer-wise distribution of % energy improvement of FlexNN over fixed schedule DNN accelerators (Table I) for ResNet101 and YOLOv2 (dense models). Optimal schedule used for each layer in FlexNN accelerator.
Refer to caption
Figure 17: Comparison of layerwise compute acceleration of FlexNN and Weight-sided (one-sided) sparse accelerator over dense accelerator (without any sparsity support) benchmarked with (1) ResNet50, (2) MobileNetV2, (3) GoogLeNet, (4) InceptionV3. Only a few representative layers are presented for each network.
Refer to caption
Figure 18: Comparison of compute acceleration for full network inference in FlexNN over dense and weight-sided accelerator benchmarked with 4 DNNs.

V-C5 Overall Network Acceleration

The compute acceleration obtained by dense, weight-sided and our proposed accelerator for the entire end-to-end network inference, depicted in Fig. 18 reveal a significant acceleration advantage conferred by FlexNN across all evaluated networks. Here, the y-axis represents the acceleration, whereas the x-axis represents benchmarks. The dense accelerator does not provide any acceleration as it cannot leverage weight or activation sparsity, denoted by values 1111. Evidently, the speed-up for weight-sided accelerator is proportional to the overall network weight sparsity. Across all these networks, the weight-sided accelerator provides 1.01×1.01\times1.01 ×1.79×1.79\times1.79 × speed-up. On the contrary, the acceleration obtained in FlexNN is proportional to the relative distribution of weight and activation sparsity. For ResNet50, weight_spnetwork,act_spnetwork𝑤𝑒𝑖𝑔𝑡_𝑠subscript𝑝𝑛𝑒𝑡𝑤𝑜𝑟𝑘𝑎𝑐𝑡_𝑠subscript𝑝𝑛𝑒𝑡𝑤𝑜𝑟𝑘weight\_sp_{network},act\_sp_{network}italic_w italic_e italic_i italic_g italic_h italic_t _ italic_s italic_p start_POSTSUBSCRIPT italic_n italic_e italic_t italic_w italic_o italic_r italic_k end_POSTSUBSCRIPT , italic_a italic_c italic_t _ italic_s italic_p start_POSTSUBSCRIPT italic_n italic_e italic_t italic_w italic_o italic_r italic_k end_POSTSUBSCRIPT = 61%,55%percent61percent5561\%,55\%61 % , 55 % and FlexNN takes advantage of them to provide 3.11×3.11\times3.11 × speed-up. MobileNetV2 and GoogLeNet has weight_spnetwork,act_spnetwork𝑤𝑒𝑖𝑔𝑡_𝑠subscript𝑝𝑛𝑒𝑡𝑤𝑜𝑟𝑘𝑎𝑐𝑡_𝑠subscript𝑝𝑛𝑒𝑡𝑤𝑜𝑟𝑘weight\_sp_{network},act\_sp_{network}italic_w italic_e italic_i italic_g italic_h italic_t _ italic_s italic_p start_POSTSUBSCRIPT italic_n italic_e italic_t italic_w italic_o italic_r italic_k end_POSTSUBSCRIPT , italic_a italic_c italic_t _ italic_s italic_p start_POSTSUBSCRIPT italic_n italic_e italic_t italic_w italic_o italic_r italic_k end_POSTSUBSCRIPT = 52%,30%percent52percent3052\%,30\%52 % , 30 % and weight_spnetwork,act_spnetwork𝑤𝑒𝑖𝑔𝑡_𝑠subscript𝑝𝑛𝑒𝑡𝑤𝑜𝑟𝑘𝑎𝑐𝑡_𝑠subscript𝑝𝑛𝑒𝑡𝑤𝑜𝑟𝑘weight\_sp_{network},act\_sp_{network}italic_w italic_e italic_i italic_g italic_h italic_t _ italic_s italic_p start_POSTSUBSCRIPT italic_n italic_e italic_t italic_w italic_o italic_r italic_k end_POSTSUBSCRIPT , italic_a italic_c italic_t _ italic_s italic_p start_POSTSUBSCRIPT italic_n italic_e italic_t italic_w italic_o italic_r italic_k end_POSTSUBSCRIPT = 24%,58%percent24percent5824\%,58\%24 % , 58 %, respectively. These results in 1.81×1.81\times1.81 × and 2.63×2.63\times2.63 × speed-up in FlexNN, respectively. Clearly, even with these two networks with low sparsity on one side, FlexNN provides a significant amount of computation due to two-sided sparsity support. Finally, InceptionV3 has weight_spnetwork,act_spnetwork𝑤𝑒𝑖𝑔𝑡_𝑠subscript𝑝𝑛𝑒𝑡𝑤𝑜𝑟𝑘𝑎𝑐𝑡_𝑠subscript𝑝𝑛𝑒𝑡𝑤𝑜𝑟𝑘weight\_sp_{network},act\_sp_{network}italic_w italic_e italic_i italic_g italic_h italic_t _ italic_s italic_p start_POSTSUBSCRIPT italic_n italic_e italic_t italic_w italic_o italic_r italic_k end_POSTSUBSCRIPT , italic_a italic_c italic_t _ italic_s italic_p start_POSTSUBSCRIPT italic_n italic_e italic_t italic_w italic_o italic_r italic_k end_POSTSUBSCRIPT = 61%,63%percent61percent6361\%,63\%61 % , 63 % contributing to 3.3×3.3\times3.3 ×, which is the maximum across the 4 networks. As evident from these results, our accelerator consistently outperforms both dense and weight-sided architectures in terms of compute acceleration. This substantial improvement, 2.6×2.6\times2.6 × vs. dense and 1.8×1.8\times1.8 × vs. weight-sided accelerator (geomean), underscores the efficacy of our proposed approach in enhancing overall network speed-up, demonstrating its superiority in accelerating DNN inference computations. Furthermore, the observed acceleration benefits are valid across the various architectural complexities and model sizes represented by the diverse DNNs considered in our evaluation. This robust performance underscores the versatility and effectiveness of two-sided sparsity acceleration support in FlexNN across a spectrum of DL model architectures.

V-C6 Energy Efficiency improvement

Figure 19 presents the improvement in energy efficiency of 3 different accelerator architectures (y-axis) while evaluating 4 DNN benchmarks (x-axis) on the ImageNet validation dataset. We considered dense accelerator energy consumption as the baseline. As evident from the figure, these results largely correlate with overall network compute acceleration in Fig. 18 since the accelerator circuits are active for a reduced amount of time. Furthermore, compared to the weight-sided accelerator, FlexNN allows for substantial reduction in memory cycle count as ZVC compressed data flows through the different memory hierarchies, resulting in reduced memory energy consumption. This is enabled by the sparsity-aware load and drain path, as explained in Section III-C. Note that DRAM transactions are not considered in these results. Across all 4 benchmarks, FlexNN is 2.4×2.4\times2.4 × and 1.7×1.7\times1.7 × more energy efficient than the dense and weight-sided accelerators, respectively.

In conclusion, our comprehensive evaluation showcases not only the substantial compute acceleration achieved by our proposed accelerator, but also its remarkable energy efficiency improvements compared to existing dense and weight-sided architectures. This underscores the pivotal role of our approach in addressing the dual challenges of performance enhancement and energy conservation in DNN accelerators, paving the way for sustainable and efficient AI hardware solutions.

Refer to caption
Figure 19: Comparison of energy efficiency for full network inference in FlexNN over dense and weight-sided accelerator benchmarked with 4 DNNs.

VI Related Work

Recent years have witnessed a surge in the field of DNN accelerators. Most DNN accelerator designs only implement fixed schedules with fixed dataflow. Fig. 20 illustrates the different DNN accelerators from industry and academia and their supported datatypes. For example, NeuFlow [42] and ISAAC [43] implement a weight stationary schedule, ShiDianNao [44] and Movidius VPU2 [45] implement an output stationary schedule, Google TPU [6] only implements Nonlocal Reuse schedule, and Eyeriss [5] from MIT implements a row stationary schedule. A key challenge arises from the limitations of the tensor data PE module hardware, which operates solely on a fixed dataflow pattern. It lacks the ability to dynamically adjust to accommodate diverse schedules, as it lacks awareness of any schedule information, owing to its restricted functionality. Therefore, one cannot implement different schedules (i.e. dataflows) in these accelerators, and till today there are no existing accelerators that can support flexible schedules. In addition to hardware solutions, software-based solutions can mimic programmable PE array units that can perform computation on varying dataflow tensor data in general-purpose CPUs and GPUs, but fixed-function accelerators do not support this flexibility in design. Therefore, these software solutions cannot be used in existing accelerators. Moreover, software solutions are far from being energy optimal to be considered for adoption in edge inference devices. FPGAs provide an alternative avenue for DNN acceleration with flexibility, but the hardware configuration of the FPGA cannot be changed during the execution of one DNN application, which also implies a fixed schedule during execution. Additionally, FPGAs have lower energy efficiency compared to ASIC hardware accelerators.

Refer to caption
Figure 20: Edge DNN accelerator competitive landscape, plotted with public data, considering baseline reference [14].

Since the PE array modules in all previous designs have limited functionalities in the form of a basic MAC structure, which is the compute kernel for convolution operation, the wide degree of mismatch between the data patterns required by different input, weight, and output stationary optimal schedules makes it impossible for the fixed architecture PE array to be able to handle the tensor data correctly without sacrificing energy and/or performance. The key disadvantage is that the PE array that performs convolution computation in these previous solutions is not schedule aware. Due to this limitation, reformatting the energy optimal dataflow type into a dataflow that is supported by the underlying fixed architecture PE array induces severe performance and energy penalty as more SRAM reads are required to complete the work and prevents the PE array from reaching maximum utilization if the accesses are serialized. Software solutions can also be used for rearranging the input, output activation as well as weight tensor data for different optimal schedules to be fed into the PE array, into the type of fixed dataflow supported by the PE array, which not only would require assisting CPUs but would also be highly energy and performance inefficient, thereby significantly diminishing the energy efficiency gains offered by flexible scheduling.

In contrast to these approaches, we propose a schedule-aware runtime configurable PE array module, which can (1) process tensor data (both weights and activations) that are either input, output, or weight stationary, or even a mixture of these dataflow types, depending upon the energy optimal schedule for the current DNN layer, (2) have its microarchitecture be reconfigurable at runtime based on software-programmable configuration registers, and (3) leverage maximum activation and weight reuse by having small amount of distributed local storage close to compute within the PE array itself. The dataflow support in the proposed PE array module is flexible. It is controlled by a list of configuration descriptors, which are set at the beginning of the execution of each layer. This tensor data computation PE array module is a pure hardware solution that exposes hardware knobs to the compiler and configures the dataflow during runtime, enabling the flexible schedules of convolutional layers in DNN accelerators without performance penalty due to rearranging computation within the PE or having to offload any work to CPU or software.

VII Conclusion

In this paper, we proposed a flexible schedule-aware DNN accelerator FlexNN, which can adapt its internal dataflow to the optimal schedule of each layer in DNNs. Our proposed solution maximizes data reuse at each memory level, resulting in significant energy savings arising from optimal data reuse. Note that flexibility works seamlessly on top of existing performance-enhancing features such as sparsity acceleration and low-precision logic, and it does not diminish their impact in any manner. It is evident that this flexibility comes at the cost of additional area overhead compared to fixed dataflow accelerators, but it also enables us to achieve significant energy savings on average across a myriad of DNN layers. Furthermore, we propose a novel approach to improve throughput and reduce energy usage in the FlexNN architecture. Taking advantage of fine-grained sparsity in both activation and weight tensors, we optimize the inference engine within the hardware accelerator. Experimental results demonstrate significant improvements in both performance and energy efficiency compared to existing DNN accelerators. This research contributes to ongoing efforts to develop more efficient hardware accelerators for executing deep neural networks.

VIII Acknowledgements

We would like to sincerely thank Gautham Chinya, Debebrata Mohapatra, Huichu Liu, Moongon Jung, Sang Kyun Kim, Guruguhanathan Venkataramanan, Raymond Sung, Hong Wang, and Cormac Brick for their contributions to this work.

References

  • [1] A. Raha, R. Sung, S. Ghosh, P. K. Gupta, D. A. Mathaikutty, U. I. Cheema, K. Hyland, C. Brick, and V. Raghunathan, “Efficient hardware acceleration of emerging neural networks for embedded machine learning: An industry perspective,” in Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing: Hardware Architectures.   Springer, 2023, pp. 121–172.
  • [2] A. Raha, S. K. Kim, D. A. Mathaikutty, G. Venkataramanan, D. Mohapatra, R. Sung, C. Brick, and G. N. Chinya, “Design considerations for edge neural network accelerators: An industry perspective,” in 2021 34th International Conference on VLSI Design and 2021 20th International Conference on Embedded Systems (VLSID).   IEEE, 2021, pp. 328–333.
  • [3] S. K. Ghosh, A. Raha, and V. Raghunathan, “Energy-efficient approximate edge inference systems,” ACM Transactions on Embedded Computing Systems, vol. 22, no. 4, pp. 1–50, 2023.
  • [4] A. Raha, S. Ghosh, D. Mohapatra, D. A. Mathaikutty, R. Sung, C. Brick, and V. Raghunathan, “Special session: Approximate tinyml systems: Full system approximations for extreme energy-efficiency in intelligent edge devices,” in 2021 IEEE 39th International Conference on Computer Design (ICCD).   IEEE, 2021, pp. 13–16.
  • [5] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks,” in Proceedings of the 43rd International Symposium on Computer Architecture, ser. ISCA ’16, 2016.
  • [6] N. P. Jouppi and et al., “In-datacenter performance analysis of a tensor processing unit,” 2017.
  • [7] H. Kwon, P. Chatarasi, M. Pellauer, A. Parashar, V. Sarkar, and T. Krishna, “Understanding reuse, performance, and hardware cost of dnn dataflow: A data-centric approach,” in Proc. MICRO, ser. MICRO ’52.   New York, NY, USA: Association for Computing Machinery, 2019, p. 754–768.
  • [8] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks,” ACM SIGARCH computer architecture news, vol. 44, no. 3, pp. 367–379, 2016.
  • [9] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE journal of solid-state circuits, vol. 52, no. 1, pp. 127–138, 2016.
  • [10] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “Scnn: An accelerator for compressed-sparse convolutional neural networks,” ACM SIGARCH computer architecture news, vol. 45, no. 2, pp. 27–40, 2017.
  • [11] S. K. Ghosh, S. Kundu, A. Raha, D. A. Mathaikutty, and V. Raghunathan, “Harvest: Towards efficient sparse dnn accelerators using programmable thresholds,” in 2024 37th International Conference on VLSI Design and 2021 20th International Conference on Embedded Systems (VLSID).   IEEE, 2024.
  • [12] Chen et al., “Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2, pp. 292–308, 2019.
  • [13] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers et al., “In-datacenter performance analysis of a tensor processing unit,” in Proceedings of the 44th annual international symposium on computer architecture, 2017, pp. 1–12.
  • [14] M. Horowitz, “1.1 Computing’s energy problem (and what we can do about it),” in Proc. ISSCC, 2014.
  • [15] T. Hoefler, D. Alistarh, T. Ben-Nun, N. Dryden, and A. Peste, “Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks,” JMLR, vol. 22, no. 1, pp. 10 882–11 005, 2021.
  • [16] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos, “Cnvlutin: Ineffectual-neuron-free deep neural network computing,” ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 1–13, 2016.
  • [17] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen, “Cambricon-x: An accelerator for sparse neural networks,” in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).   IEEE, 2016, pp. 1–12.
  • [18] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “Eie: Efficient inference engine on compressed deep neural network,” ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 243–254, 2016.
  • [19] X. Yang, M. Gao, Q. Liu, J. Setter, J. Pu, A. Nayak, S. Bell, K. Cao, H. Ha, P. Raina et al., “Interstellar: Using halide’s scheduling language to analyze dnn accelerators,” in Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, 2020, pp. 369–383.
  • [20] H. Kwon, P. Chatarasi, V. Sarkar, T. Krishna, M. Pellauer, and A. Parashar, “Maestro: A data-centric approach to understand reuse, performance, and hardware cost of dnn map**s,” IEEE micro, vol. 40, no. 3, pp. 20–29, 2020.
  • [21] D. Mohapatra, A. Raha, G. Chinya, H. Liu, C. Brick, and L. Hacking, “Configurable processor element arrays for implementing convolutional neural networks,” Apr. 30 2020, uS Patent App. 16/726,709.
  • [22] S. Hsu, A. Agarwal, D. Mohapatra, A. Raha, M. Jung, G. Chinya, and R. Krishnamurthy, “Multi-buffered register files with shared access circuits,” Apr. 22 2021, uS Patent App. 17/132,895.
  • [23] A. Raha, D. Mohapatra, G. Chinya, G. Venkataramanan, S. K. Kim, D. Mathaikutty, R. Sung, and C. Brick, “Performance scaling for dataflow deep neural network hardware accelerators,” Sep. 2 2021, uS Patent App. 17/246,341.
  • [24] D. Mohapatra, A. Raha, D. A. Mathaikutty, R. J.-H. Sung, and C. M. Brick, “Runtime configurable register files for artificial intelligence workloads,” Mar. 10 2022, uS Patent App. 17/530,156.
  • [25] D. Mohapatra, A. Raha, D. Mathaikutty, R. Sung, and C. Brick, “Schedule-aware dynamically reconfigurable adder tree architecture for partial sum accumulation in machine learning accelerators,” Apr. 28 2022, uS Patent App. 17/520,281.
  • [26] A. Raha, M. A. Anders, R. J.-H. Sung, D. Mohapatra, D. A. Mathaikutty, R. K. Krishnamurthy, and H. Kaul, “Floating point multiply-accumulate unit for deep learning,” Jun. 16 2022, uS Patent App. 17/688,131.
  • [27] G. Chinya, H. Liu, A. Raha, D. Mohapatra, C. Brick, and L. Hacking, “Schedule-aware tensor distribution module,” Feb. 20 2024, uS Patent 11,907,827.
  • [28] D. Mathaikutty, A. Raha, R. Sung, D. Mohapatra, and C. Brick, “Sparsity-aware datastore for inference processing in deep neural network architectures,” Mar. 3 2022, uS Patent App. 17/524,333.
  • [29] D. A. Mathaikutty, A. Raha, R. J.-H. Sung, and D. Mohapatra, “Data reuse in deep learning,” Jun. 16 2022, uS Patent App. 17/684,764.
  • [30] H. Kwon, A. Samajdar, and T. Krishna, “Maeri: Enabling flexible dataflow map** over dnn accelerators via reconfigurable interconnects,” ACM SIGPLAN Notices, vol. 53, no. 2, pp. 461–475, 2018.
  • [31] A. Raha, D. Mohapatra, D. A. Mathaikutty, R. J.-H. Sung, and C. M. Brick, “System and method for balancing sparsity in weights for accelerating deep neural networks,” Mar. 17 2022, uS Patent App. 17/534,976.
  • [32] G. Chinya, D. Mathaikutty, G. Venkataramanan, D. Mohapatra, M. Jung, S. K. Kim, A. Raha, and C. Brick, “Accelerated loading of unstructured sparse data in machine learning architectures,” Feb. 11 2021, uS Patent App. 17/081,509.
  • [33] A. Raha, D. Mathaikutty, D. Mohapatra, S. K. Kim, G. Chinya, and C. Brick, “Methods and apparatus to load data within a machine learning accelerator,” Oct. 21 2021, uS Patent App. 17/359,392.
  • [34] A. Raha, M. Langhammer, D. Mohapatra, N. Tunali, and M. Wu, “Methods and apparatus to perform low overhead sparsity acceleration logic for multi-precision dataflow in deep neural network accelerators,” Sep. 15 2022, uS Patent App. 17/709,337.
  • [35] S. Kundu, A. Raha, D. A. Mathaikutty, and K. Basu, “Rash: Reliable deep learning acceleration using sparsity-based hardware,” in 2024 25nd International Symposium on Quality Electronic Design (ISQED).   IEEE, 2024.
  • [36] F. Connor, D. Bernard, and N. Hanrahan, “Dot product calculators and methods of operating the same,” U.S. Patent 10,768,895 B2, Sep. 8, 2020.
  • [37] G. Chinya, D. Mohapatra, A. Raha, H. Liu, and C. Brick, “Methods, systems, articles of manufacture, and apparatus to decode zero-value-compression data vectors,” Oct. 31 2023, uS Patent 11,804,851.
  • [38] M. AI, “The latest in machine learning — papers with code,” https://paperswithcode.com/.
  • [39] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, 2012.
  • [40] A. Kozlov, I. Lazarevich, V. Shamporov, N. Lyalyushkin, and Y. Gorbachev, “Neural network compression framework for fast model inference,” arXiv preprint arXiv:2002.08679, 2020.
  • [41] J. Park, H. Yoon, D. Ahn, J. Choi, and J.-J. Kim, “Optimus: Optimized matrix multiplication structure for transformer neural network accelerator,” Proceedings of Machine Learning and Systems, vol. 2, pp. 363–378, 2020.
  • [42] C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. LeCun, “Neuflow: A runtime reconfigurable dataflow processor for vision,” in CVPR 2011 WORKSHOPS, 2011, pp. 109–116.
  • [43] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar, “Isaac: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars,” in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016, pp. 14–26.
  • [44] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, “Shidiannao: Shifting vision processing closer to the sensor,” in 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), 2015, pp. 92–104.
  • [45] Intel, “Intel keembay,” https://newsroom.intel.com/wp-content/uploads/sites/11/2019/11/intel-ai-summit-keynote-slides.pdf.