Search | arXiv e-print repository

Endor: Hardware-Friendly Sparse Format for Offloaded LLM Inference

Authors: Donghyeon Joo, Ramyad Hadidi, Soheil Feizi, Bahar Asgari

Abstract: The increasing size of large language models (LLMs) challenges their usage on resource-constrained platforms. For example, memory on modern GPUs is insufficient to hold LLMs that are hundreds of Gigabytes in size. Offloading is a popular method to escape this constraint by storing weights of an LLM model to host CPU memory and SSD, then loading each weight to GPU before every use. In our case stud… ▽ More The increasing size of large language models (LLMs) challenges their usage on resource-constrained platforms. For example, memory on modern GPUs is insufficient to hold LLMs that are hundreds of Gigabytes in size. Offloading is a popular method to escape this constraint by storing weights of an LLM model to host CPU memory and SSD, then loading each weight to GPU before every use. In our case study of offloaded inference, we found that due to the low bandwidth between storage devices and GPU, the latency of transferring large model weights from its offloaded location to GPU memory becomes the critical bottleneck with actual compute taking nearly 0% of runtime. To effectively reduce the weight transfer latency, we propose a novel sparse format that compresses the unstructured sparse pattern of pruned LLM weights to non-zero values with high compression ratio and low decompression overhead. Endor achieves this by expressing the positions of non-zero elements with a bitmap. Compared to offloaded inference using the popular Huggingface Accelerate, applying Endor accelerates OPT-66B by 1.70x and Llama2-70B by 1.78x. When direct weight transfer from SSD to GPU is leveraged, Endor achieves 2.25x speedup on OPT-66B and 2.37x speedup on Llama2-70B. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: 14 pages, 16 figures

arXiv:2406.10166 [pdf, other]

Misam: Using ML in Dataflow Selection of Sparse-Sparse Matrix Multiplication

Authors: Sanjali Yadav, Bahar Asgari

Abstract: Sparse matrix-matrix multiplication (SpGEMM) is a critical operation in numerous fields, including scientific computing, graph analytics, and deep learning. These applications exploit the sparsity of matrices to reduce storage and computational demands. However, the irregular structure of sparse matrices poses significant challenges for performance optimization. Traditional hardware accelerators a… ▽ More Sparse matrix-matrix multiplication (SpGEMM) is a critical operation in numerous fields, including scientific computing, graph analytics, and deep learning. These applications exploit the sparsity of matrices to reduce storage and computational demands. However, the irregular structure of sparse matrices poses significant challenges for performance optimization. Traditional hardware accelerators are tailored for specific sparsity patterns with fixed dataflow schemes - inner, outer, and row-wise but often perform suboptimally when the actual sparsity deviates from these predetermined patterns. As the use of SpGEMM expands across various domains, each with distinct sparsity characteristics, the demand for hardware accelerators that can efficiently handle a range of sparsity patterns is increasing. This paper presents a machine learning based approach for adaptively selecting the most appropriate dataflow scheme for SpGEMM tasks with diverse sparsity patterns. By employing decision trees and deep reinforcement learning, we explore the potential of these techniques to surpass heuristic-based methods in identifying optimal dataflow schemes. We evaluate our models by comparing their performance with that of a heuristic, highlighting the strengths and weaknesses of each approach. Our findings suggest that using machine learning for dynamic dataflow selection in hardware accelerators can provide upto 28 times gains. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: Accepted to ISCA 2024 MLArchSys workshop https://openreview.net/forum?id=A1V9FaZRbV

arXiv:2104.04563 [pdf, other]

Context-Aware Task Handling in Resource-Constrained Robots with Virtualization

Authors: Ramyad Hadidi, Nima Shoghi Ghalehshahi, Bahar Asgari, Hyesoon Kim

Abstract: Intelligent mobile robots are critical in several scenarios. However, as their computational resources are limited, mobile robots struggle to handle several tasks concurrently and yet guaranteeing real-timeliness. To address this challenge and improve the real-timeliness of critical tasks under resource constraints, we propose a fast context-aware task handling technique. To effectively handling t… ▽ More Intelligent mobile robots are critical in several scenarios. However, as their computational resources are limited, mobile robots struggle to handle several tasks concurrently and yet guaranteeing real-timeliness. To address this challenge and improve the real-timeliness of critical tasks under resource constraints, we propose a fast context-aware task handling technique. To effectively handling tasks in real-time, our proposed context-aware technique comprises of three main ingredients: (i) a dynamic time-sharing mechanism, coupled with (ii) an event-driven task scheduling using reactive programming paradigm to mindfully use the limited resources; and, (iii) a lightweight virtualized execution to easily integrate functionalities and their dependencies. We showcase our technique on a Raspberry-Pi-based robot with a variety of tasks such as Simultaneous localization and map** (SLAM), sign detection, and speech recognition with a 42% speedup in total execution time compared to the common Linux scheduler. △ Less

Submitted 9 April, 2021; originally announced April 2021.

arXiv:2011.10932 [pdf, other]

doi 10.1109/IISWC53511.2021.00012

Copernicus: Characterizing the Performance Implications of Compression Formats Used in Sparse Workloads

Authors: Bahar Asgari, Ramyad Hadidi, Joshua Dierberger, Charlotte Steinichen, Amaan Marfatia, Hyesoon Kim

Abstract: Sparse matrices are the key ingredients of several application domains, from scientific computation to machine learning. The primary challenge with sparse matrices has been efficiently storing and transferring data, for which many sparse formats have been proposed to significantly eliminate zero entries. Such formats, essentially designed to optimize memory footprint, may not be as successful in p… ▽ More Sparse matrices are the key ingredients of several application domains, from scientific computation to machine learning. The primary challenge with sparse matrices has been efficiently storing and transferring data, for which many sparse formats have been proposed to significantly eliminate zero entries. Such formats, essentially designed to optimize memory footprint, may not be as successful in performing faster processing. In other words, although they allow faster data transfer and improve memory bandwidth utilization -- the classic challenge of sparse problems -- their decompression mechanism can potentially create a computation bottleneck. Not only is this challenge not resolved, but also it becomes more serious with the advent of domain-specific architectures (DSAs), as they intend to more aggressively improve performance. The performance implications of using various formats along with DSAs, however, has not been extensively studied by prior work. To fill this gap of knowledge, we characterize the impact of using seven frequently used sparse formats on performance, based on a DSA for sparse matrix-vector multiplication (SpMV), implemented on an FPGA using high-level synthesis (HLS) tools, a growing and popular method for develo** DSAs. Seeking a fair comparison, we tailor and optimize the HLS implementation of decompression for each format. We thoroughly explore diverse metrics, including decompression overhead, latency, balance ratio, throughput, memory bandwidth utilization, resource utilization, and power consumption, on a variety of real-world and synthetic sparse workloads. △ Less

Submitted 18 October, 2021; v1 submitted 21 November, 2020; originally announced November 2020.

Comments: 11 pages, 14 figures, 2 tables

arXiv:2003.06464 [pdf, other]

LCP: A Low-Communication Parallelization Method for Fast Neural Network Inference in Image Recognition

Authors: Ramyad Hadidi, Bahar Asgari, Jiashen Cao, Younmin Bae, Da Eun Shim, Hyojong Kim, Sung-Kyu Lim, Michael S. Ryoo, Hyesoon Kim

Abstract: Deep neural networks (DNNs) have inspired new studies in myriad edge applications with robots, autonomous agents, and Internet-of-things (IoT) devices. However, performing inference of DNNs in the edge is still a severe challenge, mainly because of the contradiction between the intensive resource requirements of DNNs and the tight resource availability in several edge domains. Further, as communic… ▽ More Deep neural networks (DNNs) have inspired new studies in myriad edge applications with robots, autonomous agents, and Internet-of-things (IoT) devices. However, performing inference of DNNs in the edge is still a severe challenge, mainly because of the contradiction between the intensive resource requirements of DNNs and the tight resource availability in several edge domains. Further, as communication is costly, taking advantage of other available edge devices by using data- or model-parallelism methods is not an effective solution. To benefit from available compute resources with low communication overhead, we propose the first DNN parallelization method for reducing the communication overhead in a distributed system. We propose a low-communication parallelization (LCP) method in which models consist of several almost-independent and narrow branches. LCP offers close-to-minimum communication overhead with better distribution and parallelization opportunities while significantly reducing memory footprint and computation compared to data- and model-parallelism methods. We deploy LCP models on three distributed systems: AWS instances, Raspberry Pis, and PYNQ boards. We also evaluate the performance of LCP models on a customized hardware (tailored for low latency) implemented on a small edge FPGA and as a 16mW 0.107mm2 ASIC @7nm chip. LCP models achieve a maximum and average speedups of 56x and 7x, compared to the originals, which could be improved by up to an average speedup of 33x by incorporating common optimizations such as pruning and quantization. △ Less

Submitted 17 November, 2020; v1 submitted 13 March, 2020; originally announced March 2020.

arXiv:1803.06068 [pdf, other]

Memory Slices: A Modular Building Block for Scalable, Intelligent Memory Systems

Authors: Bahar Asgari, Saibal Mukhopadhyay, Sudhakar Yalamanchili

Abstract: While reduction in feature size makes computation cheaper in terms of latency, area, and power consumption, performance of emerging data-intensive applications is determined by data movement. These trends have introduced the concept of scalability as reaching a desirable performance per unit cost by using as few number of units as possible. Many proposals have moved compute closer to the memory. H… ▽ More While reduction in feature size makes computation cheaper in terms of latency, area, and power consumption, performance of emerging data-intensive applications is determined by data movement. These trends have introduced the concept of scalability as reaching a desirable performance per unit cost by using as few number of units as possible. Many proposals have moved compute closer to the memory. However, these efforts ignored maintaining a balance between bandwidth and compute rate of an architecture, with those of applications, which is a key principle in designing scalable large systems. This paper proposes the use of memory slices, a modular building block for scalable memory systems integrated with compute, in which performance scales with memory size (and volume of data). The slice architecture utilizes a programmable memory interface feeding a systolic compute engine with high reuse rate. The modularity feature of slice-based systems is exploited with a partitioning and data map** strategy across allocated memory slices where training performance scales with the data size. These features enable shifting the most pressure to cheap compute units rather than expensive memory accesses or transfers via interconnection network. An application of the memory slices to a scale-out memory system is accelerating the training of recurrent, convolutional, and hybrid neural networks (RNNs and RNNs+CNN) that are forming cloud workloads. The results of our cycle-level simulations show that memory slices exhibits a superlinear speedup when the number of slices increases. Furthermore, memory slices improve power efficiency to 747 GFLOPs/J for training LSTMs. While our current evaluation uses memory slices with 3D packaging, a major value is that slices can also be constructed with a variety of packaging options, for example with DDR-based memory units. △ Less

Submitted 15 March, 2018; originally announced March 2018.

arXiv:1707.05399 [pdf, other]

doi 10.1109/ISPASS.2018.00018

Performance Implications of NoCs on 3D-Stacked Memories: Insights from the Hybrid Memory Cube

Authors: Ramyad Hadidi, Bahar Asgari, Jeffrey Young, Burhan Ahmad Mudassar, Kartikay Garg, Tushar Krishna, Hyesoon Kim

Abstract: Memories that exploit three-dimensional (3D)-stacking technology, which integrate memory and logic dies in a single stack, are becoming popular. These memories, such as Hybrid Memory Cube (HMC), utilize a network-on-chip (NoC) design for connecting their internal structural organizations. This novel usage of NoC, in addition to aiding processing-in-memory capabilities, enables numerous benefits su… ▽ More Memories that exploit three-dimensional (3D)-stacking technology, which integrate memory and logic dies in a single stack, are becoming popular. These memories, such as Hybrid Memory Cube (HMC), utilize a network-on-chip (NoC) design for connecting their internal structural organizations. This novel usage of NoC, in addition to aiding processing-in-memory capabilities, enables numerous benefits such as high bandwidth and memory-level parallelism. However, the implications of NoCs on the characteristics of 3D-stacked memories in terms of memory access latency and bandwidth have not been fully explored. This paper addresses this knowledge gap by (i) characterizing an HMC prototype on the AC-510 accelerator board and revealing its access latency behaviors, and (ii) by investigating the implications of such behaviors on system and software designs. △ Less

Submitted 13 February, 2018; v1 submitted 17 July, 2017; originally announced July 2017.

Journal ref: 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

arXiv:1706.02725 [pdf, other]

doi 10.1109/IISWC.2017.8167757

Demystifying the Characteristics of 3D-Stacked Memories: A Case Study for Hybrid Memory Cube

Authors: Ramyad Hadidi, Bahar Asgari, Burhan Ahmad Mudassar, Saibal Mukhopadhyay, Sudhakar Yalamanchili, Hyesoon Kim

Abstract: Three-dimensional (3D)-stacking technology, which enables the integration of DRAM and logic dies, offers high bandwidth and low energy consumption. This technology also empowers new memory designs for executing tasks not traditionally associated with memories. A practical 3D-stacked memory is Hybrid Memory Cube (HMC), which provides significant access bandwidth and low power consumption in a small… ▽ More Three-dimensional (3D)-stacking technology, which enables the integration of DRAM and logic dies, offers high bandwidth and low energy consumption. This technology also empowers new memory designs for executing tasks not traditionally associated with memories. A practical 3D-stacked memory is Hybrid Memory Cube (HMC), which provides significant access bandwidth and low power consumption in a small area. Although several studies have taken advantage of the novel architecture of HMC, its characteristics in terms of latency and bandwidth or their correlation with temperature and power consumption have not been fully explored. This paper is the first, to the best of our knowledge, to characterize the thermal behavior of HMC in a real environment using the AC-510 accelerator and to identify temperature as a new limitation for this state-of-the-art design space. Moreover, besides bandwidth studies, we deconstruct factors that contribute to latency and reveal their sources for high- and low-load accesses. The results of this paper demonstrates essential behaviors and performance bottlenecks for future explorations of packet-switched and 3D-stacked memories. △ Less

Submitted 3 October, 2017; v1 submitted 8 June, 2017; originally announced June 2017.

Comments: EEE Catalog Number: CFP17236-USB ISBN 13: 978-1-5386-1232-3

Journal ref: Proceedings of the 2017 IEEE International Symposium on Workload Characterization

arXiv:0805.0888 [pdf]

Geometrical Variation Analysis of an Electrothermally Driven Polysilicon Microactuator

Authors: M. Shamshirsaz, M. Maroufi, M. B. Asgari

Abstract: The analytical models that predict thermal and mechanical responses of microactuator have been developed. These models are based on electro thermal and thermo mechanical analysis of the microbeam. Also, Finite Element Analysis (FEA) is used to evaluate microactuator tip deflection. Analytical and Finite Element results are compared with experimental results in literature and show good agreement… ▽ More The analytical models that predict thermal and mechanical responses of microactuator have been developed. These models are based on electro thermal and thermo mechanical analysis of the microbeam. Also, Finite Element Analysis (FEA) is used to evaluate microactuator tip deflection. Analytical and Finite Element results are compared with experimental results in literature and show good agreement in low input voltages. A dimensional variation of beam lengths, beam lengths ratios and gap are introduced in analytical and FEA models to explore microactuator performance. An electrothermally driven polysilicon microactuator similar to Pan's actuator architecture has been studied in this paper. This microactuator generates deflection through asymmetric heating of the hot and cold polysilicon arms with different lengths. For this microactuator architecture, an optimal beam length ratio equal to 0.46 has been obtained to achieve maximum tip deflection. . As it was expected, the results show decreasing air gap increase microactuator tip deflection. It is also found that for microactuators with longer hot arms, microactuator tip deflections are more sensitive to beam length ratios and air gap. △ Less

Submitted 7 May, 2008; originally announced May 2008.

Comments: Submitted on behalf of EDA Publishing Association (http://irevues.inist.fr/handle/2042/16838)

Journal ref: Dans Symposium on Design, Test, Integration and Packaging of MEMS/MOEMS - DTIP 2008, Nice : France (2008)

arXiv:0802.3054 [pdf]

Analysis of polysilicon micro beams buckling with temperature-dependent properties

Authors: M. Shamshirsaz, M. Bahrami, M. B. Asgari, M. Tayefeh

Abstract: The suspended electrothermal polysilicon micro beams generate displacements and forces by thermal buckling effects. In the previous electro-thermal and thermo-elastic models of suspended polysilicon micro beams, the thermo-mechanical properties of polysilicon have been considered constant over a wide rang of temperature (20- 900 degrees C). In reality, the thermo-mechanical properties of polysil… ▽ More The suspended electrothermal polysilicon micro beams generate displacements and forces by thermal buckling effects. In the previous electro-thermal and thermo-elastic models of suspended polysilicon micro beams, the thermo-mechanical properties of polysilicon have been considered constant over a wide rang of temperature (20- 900 degrees C). In reality, the thermo-mechanical properties of polysilicon depend on temperature and change significantly at high temperatures. This paper describes the development and validation of theoretical and Finite Element Model (FEM) including the temperature dependencies of polysilicon properties such as thermal expansion coefficient and Young's modulus. In the theoretical models, two parts of elastic deflection model and thermal elastic model of micro beams buckling have been established and simulated. Also, temperature dependent buckling of polysilicon micro beam under high temperature has been modeled by Finite Element Analysis (FEA). Analytical results and numerical results using FEA are compared with experimental data available in literature. Their reasonable agreement validates analytical model and FEM. This validation indicates the importance of including temperature dependencies of polysilicon thermo-mechanical properties such as Coefficient of Thermal Expansion (CTE) in the previous models. △ Less

Submitted 21 February, 2008; originally announced February 2008.

Comments: Submitted on behalf of EDA Publishing Association (http://irevues.inist.fr/EDA-Publishing)

Journal ref: Dans Symposium on Design, Test, Integration and Packaging of MEMS/MOEMS - DTIP 2007, Stresa, lago Maggiore : Italie (2007)

Showing 1–10 of 10 results for author: Asgari, B