-
Endor: Hardware-Friendly Sparse Format for Offloaded LLM Inference
Authors:
Donghyeon Joo,
Ramyad Hadidi,
Soheil Feizi,
Bahar Asgari
Abstract:
The increasing size of large language models (LLMs) challenges their usage on resource-constrained platforms. For example, memory on modern GPUs is insufficient to hold LLMs that are hundreds of Gigabytes in size. Offloading is a popular method to escape this constraint by storing weights of an LLM model to host CPU memory and SSD, then loading each weight to GPU before every use. In our case stud…
▽ More
The increasing size of large language models (LLMs) challenges their usage on resource-constrained platforms. For example, memory on modern GPUs is insufficient to hold LLMs that are hundreds of Gigabytes in size. Offloading is a popular method to escape this constraint by storing weights of an LLM model to host CPU memory and SSD, then loading each weight to GPU before every use. In our case study of offloaded inference, we found that due to the low bandwidth between storage devices and GPU, the latency of transferring large model weights from its offloaded location to GPU memory becomes the critical bottleneck with actual compute taking nearly 0% of runtime. To effectively reduce the weight transfer latency, we propose a novel sparse format that compresses the unstructured sparse pattern of pruned LLM weights to non-zero values with high compression ratio and low decompression overhead. Endor achieves this by expressing the positions of non-zero elements with a bitmap. Compared to offloaded inference using the popular Huggingface Accelerate, applying Endor accelerates OPT-66B by 1.70x and Llama2-70B by 1.78x. When direct weight transfer from SSD to GPU is leveraged, Endor achieves 2.25x speedup on OPT-66B and 2.37x speedup on Llama2-70B.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Misam: Using ML in Dataflow Selection of Sparse-Sparse Matrix Multiplication
Authors:
Sanjali Yadav,
Bahar Asgari
Abstract:
Sparse matrix-matrix multiplication (SpGEMM) is a critical operation in numerous fields, including scientific computing, graph analytics, and deep learning. These applications exploit the sparsity of matrices to reduce storage and computational demands. However, the irregular structure of sparse matrices poses significant challenges for performance optimization. Traditional hardware accelerators a…
▽ More
Sparse matrix-matrix multiplication (SpGEMM) is a critical operation in numerous fields, including scientific computing, graph analytics, and deep learning. These applications exploit the sparsity of matrices to reduce storage and computational demands. However, the irregular structure of sparse matrices poses significant challenges for performance optimization. Traditional hardware accelerators are tailored for specific sparsity patterns with fixed dataflow schemes - inner, outer, and row-wise but often perform suboptimally when the actual sparsity deviates from these predetermined patterns. As the use of SpGEMM expands across various domains, each with distinct sparsity characteristics, the demand for hardware accelerators that can efficiently handle a range of sparsity patterns is increasing. This paper presents a machine learning based approach for adaptively selecting the most appropriate dataflow scheme for SpGEMM tasks with diverse sparsity patterns. By employing decision trees and deep reinforcement learning, we explore the potential of these techniques to surpass heuristic-based methods in identifying optimal dataflow schemes. We evaluate our models by comparing their performance with that of a heuristic, highlighting the strengths and weaknesses of each approach. Our findings suggest that using machine learning for dynamic dataflow selection in hardware accelerators can provide upto 28 times gains.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
Context-Aware Task Handling in Resource-Constrained Robots with Virtualization
Authors:
Ramyad Hadidi,
Nima Shoghi Ghalehshahi,
Bahar Asgari,
Hyesoon Kim
Abstract:
Intelligent mobile robots are critical in several scenarios. However, as their computational resources are limited, mobile robots struggle to handle several tasks concurrently and yet guaranteeing real-timeliness. To address this challenge and improve the real-timeliness of critical tasks under resource constraints, we propose a fast context-aware task handling technique. To effectively handling t…
▽ More
Intelligent mobile robots are critical in several scenarios. However, as their computational resources are limited, mobile robots struggle to handle several tasks concurrently and yet guaranteeing real-timeliness. To address this challenge and improve the real-timeliness of critical tasks under resource constraints, we propose a fast context-aware task handling technique. To effectively handling tasks in real-time, our proposed context-aware technique comprises of three main ingredients: (i) a dynamic time-sharing mechanism, coupled with (ii) an event-driven task scheduling using reactive programming paradigm to mindfully use the limited resources; and, (iii) a lightweight virtualized execution to easily integrate functionalities and their dependencies. We showcase our technique on a Raspberry-Pi-based robot with a variety of tasks such as Simultaneous localization and map** (SLAM), sign detection, and speech recognition with a 42% speedup in total execution time compared to the common Linux scheduler.
△ Less
Submitted 9 April, 2021;
originally announced April 2021.
-
Copernicus: Characterizing the Performance Implications of Compression Formats Used in Sparse Workloads
Authors:
Bahar Asgari,
Ramyad Hadidi,
Joshua Dierberger,
Charlotte Steinichen,
Amaan Marfatia,
Hyesoon Kim
Abstract:
Sparse matrices are the key ingredients of several application domains, from scientific computation to machine learning. The primary challenge with sparse matrices has been efficiently storing and transferring data, for which many sparse formats have been proposed to significantly eliminate zero entries. Such formats, essentially designed to optimize memory footprint, may not be as successful in p…
▽ More
Sparse matrices are the key ingredients of several application domains, from scientific computation to machine learning. The primary challenge with sparse matrices has been efficiently storing and transferring data, for which many sparse formats have been proposed to significantly eliminate zero entries. Such formats, essentially designed to optimize memory footprint, may not be as successful in performing faster processing. In other words, although they allow faster data transfer and improve memory bandwidth utilization -- the classic challenge of sparse problems -- their decompression mechanism can potentially create a computation bottleneck. Not only is this challenge not resolved, but also it becomes more serious with the advent of domain-specific architectures (DSAs), as they intend to more aggressively improve performance. The performance implications of using various formats along with DSAs, however, has not been extensively studied by prior work. To fill this gap of knowledge, we characterize the impact of using seven frequently used sparse formats on performance, based on a DSA for sparse matrix-vector multiplication (SpMV), implemented on an FPGA using high-level synthesis (HLS) tools, a growing and popular method for develo** DSAs. Seeking a fair comparison, we tailor and optimize the HLS implementation of decompression for each format. We thoroughly explore diverse metrics, including decompression overhead, latency, balance ratio, throughput, memory bandwidth utilization, resource utilization, and power consumption, on a variety of real-world and synthetic sparse workloads.
△ Less
Submitted 18 October, 2021; v1 submitted 21 November, 2020;
originally announced November 2020.
-
LCP: A Low-Communication Parallelization Method for Fast Neural Network Inference in Image Recognition
Authors:
Ramyad Hadidi,
Bahar Asgari,
Jiashen Cao,
Younmin Bae,
Da Eun Shim,
Hyojong Kim,
Sung-Kyu Lim,
Michael S. Ryoo,
Hyesoon Kim
Abstract:
Deep neural networks (DNNs) have inspired new studies in myriad edge applications with robots, autonomous agents, and Internet-of-things (IoT) devices. However, performing inference of DNNs in the edge is still a severe challenge, mainly because of the contradiction between the intensive resource requirements of DNNs and the tight resource availability in several edge domains. Further, as communic…
▽ More
Deep neural networks (DNNs) have inspired new studies in myriad edge applications with robots, autonomous agents, and Internet-of-things (IoT) devices. However, performing inference of DNNs in the edge is still a severe challenge, mainly because of the contradiction between the intensive resource requirements of DNNs and the tight resource availability in several edge domains. Further, as communication is costly, taking advantage of other available edge devices by using data- or model-parallelism methods is not an effective solution. To benefit from available compute resources with low communication overhead, we propose the first DNN parallelization method for reducing the communication overhead in a distributed system. We propose a low-communication parallelization (LCP) method in which models consist of several almost-independent and narrow branches. LCP offers close-to-minimum communication overhead with better distribution and parallelization opportunities while significantly reducing memory footprint and computation compared to data- and model-parallelism methods. We deploy LCP models on three distributed systems: AWS instances, Raspberry Pis, and PYNQ boards. We also evaluate the performance of LCP models on a customized hardware (tailored for low latency) implemented on a small edge FPGA and as a 16mW 0.107mm2 ASIC @7nm chip. LCP models achieve a maximum and average speedups of 56x and 7x, compared to the originals, which could be improved by up to an average speedup of 33x by incorporating common optimizations such as pruning and quantization.
△ Less
Submitted 17 November, 2020; v1 submitted 13 March, 2020;
originally announced March 2020.
-
Memory Slices: A Modular Building Block for Scalable, Intelligent Memory Systems
Authors:
Bahar Asgari,
Saibal Mukhopadhyay,
Sudhakar Yalamanchili
Abstract:
While reduction in feature size makes computation cheaper in terms of latency, area, and power consumption, performance of emerging data-intensive applications is determined by data movement. These trends have introduced the concept of scalability as reaching a desirable performance per unit cost by using as few number of units as possible. Many proposals have moved compute closer to the memory. H…
▽ More
While reduction in feature size makes computation cheaper in terms of latency, area, and power consumption, performance of emerging data-intensive applications is determined by data movement. These trends have introduced the concept of scalability as reaching a desirable performance per unit cost by using as few number of units as possible. Many proposals have moved compute closer to the memory. However, these efforts ignored maintaining a balance between bandwidth and compute rate of an architecture, with those of applications, which is a key principle in designing scalable large systems. This paper proposes the use of memory slices, a modular building block for scalable memory systems integrated with compute, in which performance scales with memory size (and volume of data). The slice architecture utilizes a programmable memory interface feeding a systolic compute engine with high reuse rate. The modularity feature of slice-based systems is exploited with a partitioning and data map** strategy across allocated memory slices where training performance scales with the data size. These features enable shifting the most pressure to cheap compute units rather than expensive memory accesses or transfers via interconnection network. An application of the memory slices to a scale-out memory system is accelerating the training of recurrent, convolutional, and hybrid neural networks (RNNs and RNNs+CNN) that are forming cloud workloads. The results of our cycle-level simulations show that memory slices exhibits a superlinear speedup when the number of slices increases. Furthermore, memory slices improve power efficiency to 747 GFLOPs/J for training LSTMs. While our current evaluation uses memory slices with 3D packaging, a major value is that slices can also be constructed with a variety of packaging options, for example with DDR-based memory units.
△ Less
Submitted 15 March, 2018;
originally announced March 2018.
-
Performance Implications of NoCs on 3D-Stacked Memories: Insights from the Hybrid Memory Cube
Authors:
Ramyad Hadidi,
Bahar Asgari,
Jeffrey Young,
Burhan Ahmad Mudassar,
Kartikay Garg,
Tushar Krishna,
Hyesoon Kim
Abstract:
Memories that exploit three-dimensional (3D)-stacking technology, which integrate memory and logic dies in a single stack, are becoming popular. These memories, such as Hybrid Memory Cube (HMC), utilize a network-on-chip (NoC) design for connecting their internal structural organizations. This novel usage of NoC, in addition to aiding processing-in-memory capabilities, enables numerous benefits su…
▽ More
Memories that exploit three-dimensional (3D)-stacking technology, which integrate memory and logic dies in a single stack, are becoming popular. These memories, such as Hybrid Memory Cube (HMC), utilize a network-on-chip (NoC) design for connecting their internal structural organizations. This novel usage of NoC, in addition to aiding processing-in-memory capabilities, enables numerous benefits such as high bandwidth and memory-level parallelism. However, the implications of NoCs on the characteristics of 3D-stacked memories in terms of memory access latency and bandwidth have not been fully explored. This paper addresses this knowledge gap by (i) characterizing an HMC prototype on the AC-510 accelerator board and revealing its access latency behaviors, and (ii) by investigating the implications of such behaviors on system and software designs.
△ Less
Submitted 13 February, 2018; v1 submitted 17 July, 2017;
originally announced July 2017.
-
Demystifying the Characteristics of 3D-Stacked Memories: A Case Study for Hybrid Memory Cube
Authors:
Ramyad Hadidi,
Bahar Asgari,
Burhan Ahmad Mudassar,
Saibal Mukhopadhyay,
Sudhakar Yalamanchili,
Hyesoon Kim
Abstract:
Three-dimensional (3D)-stacking technology, which enables the integration of DRAM and logic dies, offers high bandwidth and low energy consumption. This technology also empowers new memory designs for executing tasks not traditionally associated with memories. A practical 3D-stacked memory is Hybrid Memory Cube (HMC), which provides significant access bandwidth and low power consumption in a small…
▽ More
Three-dimensional (3D)-stacking technology, which enables the integration of DRAM and logic dies, offers high bandwidth and low energy consumption. This technology also empowers new memory designs for executing tasks not traditionally associated with memories. A practical 3D-stacked memory is Hybrid Memory Cube (HMC), which provides significant access bandwidth and low power consumption in a small area. Although several studies have taken advantage of the novel architecture of HMC, its characteristics in terms of latency and bandwidth or their correlation with temperature and power consumption have not been fully explored. This paper is the first, to the best of our knowledge, to characterize the thermal behavior of HMC in a real environment using the AC-510 accelerator and to identify temperature as a new limitation for this state-of-the-art design space. Moreover, besides bandwidth studies, we deconstruct factors that contribute to latency and reveal their sources for high- and low-load accesses. The results of this paper demonstrates essential behaviors and performance bottlenecks for future explorations of packet-switched and 3D-stacked memories.
△ Less
Submitted 3 October, 2017; v1 submitted 8 June, 2017;
originally announced June 2017.
-
Geometrical Variation Analysis of an Electrothermally Driven Polysilicon Microactuator
Authors:
M. Shamshirsaz,
M. Maroufi,
M. B. Asgari
Abstract:
The analytical models that predict thermal and mechanical responses of microactuator have been developed. These models are based on electro thermal and thermo mechanical analysis of the microbeam. Also, Finite Element Analysis (FEA) is used to evaluate microactuator tip deflection. Analytical and Finite Element results are compared with experimental results in literature and show good agreement…
▽ More
The analytical models that predict thermal and mechanical responses of microactuator have been developed. These models are based on electro thermal and thermo mechanical analysis of the microbeam. Also, Finite Element Analysis (FEA) is used to evaluate microactuator tip deflection. Analytical and Finite Element results are compared with experimental results in literature and show good agreement in low input voltages. A dimensional variation of beam lengths, beam lengths ratios and gap are introduced in analytical and FEA models to explore microactuator performance. An electrothermally driven polysilicon microactuator similar to Pan's actuator architecture has been studied in this paper. This microactuator generates deflection through asymmetric heating of the hot and cold polysilicon arms with different lengths. For this microactuator architecture, an optimal beam length ratio equal to 0.46 has been obtained to achieve maximum tip deflection. . As it was expected, the results show decreasing air gap increase microactuator tip deflection. It is also found that for microactuators with longer hot arms, microactuator tip deflections are more sensitive to beam length ratios and air gap.
△ Less
Submitted 7 May, 2008;
originally announced May 2008.
-
Analysis of polysilicon micro beams buckling with temperature-dependent properties
Authors:
M. Shamshirsaz,
M. Bahrami,
M. B. Asgari,
M. Tayefeh
Abstract:
The suspended electrothermal polysilicon micro beams generate displacements and forces by thermal buckling effects. In the previous electro-thermal and thermo-elastic models of suspended polysilicon micro beams, the thermo-mechanical properties of polysilicon have been considered constant over a wide rang of temperature (20- 900 degrees C). In reality, the thermo-mechanical properties of polysil…
▽ More
The suspended electrothermal polysilicon micro beams generate displacements and forces by thermal buckling effects. In the previous electro-thermal and thermo-elastic models of suspended polysilicon micro beams, the thermo-mechanical properties of polysilicon have been considered constant over a wide rang of temperature (20- 900 degrees C). In reality, the thermo-mechanical properties of polysilicon depend on temperature and change significantly at high temperatures. This paper describes the development and validation of theoretical and Finite Element Model (FEM) including the temperature dependencies of polysilicon properties such as thermal expansion coefficient and Young's modulus. In the theoretical models, two parts of elastic deflection model and thermal elastic model of micro beams buckling have been established and simulated. Also, temperature dependent buckling of polysilicon micro beam under high temperature has been modeled by Finite Element Analysis (FEA). Analytical results and numerical results using FEA are compared with experimental data available in literature. Their reasonable agreement validates analytical model and FEM. This validation indicates the importance of including temperature dependencies of polysilicon thermo-mechanical properties such as Coefficient of Thermal Expansion (CTE) in the previous models.
△ Less
Submitted 21 February, 2008;
originally announced February 2008.