Search | arXiv e-print repository

Endor: Hardware-Friendly Sparse Format for Offloaded LLM Inference

Authors: Donghyeon Joo, Ramyad Hadidi, Soheil Feizi, Bahar Asgari

Abstract: The increasing size of large language models (LLMs) challenges their usage on resource-constrained platforms. For example, memory on modern GPUs is insufficient to hold LLMs that are hundreds of Gigabytes in size. Offloading is a popular method to escape this constraint by storing weights of an LLM model to host CPU memory and SSD, then loading each weight to GPU before every use. In our case stud… ▽ More The increasing size of large language models (LLMs) challenges their usage on resource-constrained platforms. For example, memory on modern GPUs is insufficient to hold LLMs that are hundreds of Gigabytes in size. Offloading is a popular method to escape this constraint by storing weights of an LLM model to host CPU memory and SSD, then loading each weight to GPU before every use. In our case study of offloaded inference, we found that due to the low bandwidth between storage devices and GPU, the latency of transferring large model weights from its offloaded location to GPU memory becomes the critical bottleneck with actual compute taking nearly 0% of runtime. To effectively reduce the weight transfer latency, we propose a novel sparse format that compresses the unstructured sparse pattern of pruned LLM weights to non-zero values with high compression ratio and low decompression overhead. Endor achieves this by expressing the positions of non-zero elements with a bitmap. Compared to offloaded inference using the popular Huggingface Accelerate, applying Endor accelerates OPT-66B by 1.70x and Llama2-70B by 1.78x. When direct weight transfer from SSD to GPU is leveraged, Endor achieves 2.25x speedup on OPT-66B and 2.37x speedup on Llama2-70B. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: 14 pages, 16 figures

arXiv:2404.10689 [pdf, other]

Network architecture search of X-ray based scientific applications

Authors: Adarsha Balaji, Ramyad Hadidi, Gregory Kollmer, Mohammed E. Fouda, Prasanna Balaprakash

Abstract: X-ray and electron diffraction-based microscopy use bragg peak detection and ptychography to perform 3-D imaging at an atomic resolution. Typically, these techniques are implemented using computationally complex tasks such as a Psuedo-Voigt function or solving a complex inverse problem. Recently, the use of deep neural networks has improved the existing state-of-the-art approaches. However, the de… ▽ More X-ray and electron diffraction-based microscopy use bragg peak detection and ptychography to perform 3-D imaging at an atomic resolution. Typically, these techniques are implemented using computationally complex tasks such as a Psuedo-Voigt function or solving a complex inverse problem. Recently, the use of deep neural networks has improved the existing state-of-the-art approaches. However, the design and development of the neural network models depends on time and labor intensive tuning of the model by application experts. To that end, we propose a hyperparameter (HPS) and neural architecture search (NAS) approach to automate the design and optimization of the neural network models for model size, energy consumption and throughput. We demonstrate the improved performance of the auto-tuned models when compared to the manually tuned BraggNN and PtychoNN benchmark. We study and demonstrate the importance of the exploring the search space of tunable hyperparameters in enhancing the performance of bragg peak detection and ptychographic reconstruction. Our NAS and HPS of (1) BraggNN achieves a 31.03\% improvement in bragg peak detection accuracy with a 87.57\% reduction in model size, and (2) PtychoNN achieves a 16.77\% improvement in model accuracy and a 12.82\% reduction in model size when compared to the baseline PtychoNN model. When inferred on the Orin-AGX platform, the optimized Braggnn and Ptychonn models demonstrate a 10.51\% and 9.47\% reduction in inference latency and a 44.18\% and 15.34\% reduction in energy consumption when compared to their respective baselines, when inferred in the Orin-AGX edge platform. △ Less

Submitted 16 April, 2024; originally announced April 2024.

arXiv:2104.04563 [pdf, other]

Context-Aware Task Handling in Resource-Constrained Robots with Virtualization

Authors: Ramyad Hadidi, Nima Shoghi Ghalehshahi, Bahar Asgari, Hyesoon Kim

Abstract: Intelligent mobile robots are critical in several scenarios. However, as their computational resources are limited, mobile robots struggle to handle several tasks concurrently and yet guaranteeing real-timeliness. To address this challenge and improve the real-timeliness of critical tasks under resource constraints, we propose a fast context-aware task handling technique. To effectively handling t… ▽ More Intelligent mobile robots are critical in several scenarios. However, as their computational resources are limited, mobile robots struggle to handle several tasks concurrently and yet guaranteeing real-timeliness. To address this challenge and improve the real-timeliness of critical tasks under resource constraints, we propose a fast context-aware task handling technique. To effectively handling tasks in real-time, our proposed context-aware technique comprises of three main ingredients: (i) a dynamic time-sharing mechanism, coupled with (ii) an event-driven task scheduling using reactive programming paradigm to mindfully use the limited resources; and, (iii) a lightweight virtualized execution to easily integrate functionalities and their dependencies. We showcase our technique on a Raspberry-Pi-based robot with a variety of tasks such as Simultaneous localization and map** (SLAM), sign detection, and speech recognition with a 42% speedup in total execution time compared to the common Linux scheduler. △ Less

Submitted 9 April, 2021; originally announced April 2021.

arXiv:2104.04447 [pdf, other]

Creating Robust Deep Neural Networks With Coded Distributed Computing for IoT Systems

Authors: Ramyad Hadidi, Jiashen Cao, Hyesoon Kim

Abstract: The increasing interest in serverless computation and ubiquitous wireless networks has led to numerous connected devices in our surroundings. Among such devices, IoT devices have access to an abundance of raw data, but their inadequate resources in computing limit their capabilities. Specifically, with the emergence of deep neural networks (DNNs), not only is the demand for the computing power of… ▽ More The increasing interest in serverless computation and ubiquitous wireless networks has led to numerous connected devices in our surroundings. Among such devices, IoT devices have access to an abundance of raw data, but their inadequate resources in computing limit their capabilities. Specifically, with the emergence of deep neural networks (DNNs), not only is the demand for the computing power of IoT devices increasing but also privacy concerns are pushing computations towards the edge. To overcome inadequate resources, several studies have proposed the distribution of work among IoT devices. These promising methods harvest the aggregated computing power of the idle IoT devices in an environment. However, since such a distributed system strongly relies on each device, unstable latencies, and intermittent failures, the common characteristics of IoT devices and wireless networks, cause high recovery overheads. To reduce this overhead, we propose a novel robustness method with a close-to-zero recovery latency for DNN computations. Our solution never loses a request or spends time recovering from a failure. To do so, first, we analyze the underlying matrix-matrix computations affected by distribution. Then, we introduce a new coded distributed computing (CDC) method that has a constant cost with the increasing number of devices, unlike the linear cost of modular redundancies. Moreover, our method is applied in the library level, without requiring extensive changes to the program, while still ensuring a balanced work assignment during distribution. To illustrate our method, we perform experiments with distributed systems comprising up to 12 Raspberry Pis. △ Less

Submitted 9 April, 2021; originally announced April 2021.

arXiv:2102.08481 [pdf, other]

THIA: Accelerating Video Analytics using Early Inference and Fine-Grained Query Planning

Authors: Jiashen Cao, Ramyad Hadidi, Joy Arulraj, Hyesoon Kim

Abstract: To efficiently process visual data at scale, researchers have proposed two techniques for lowering the computational overhead associated with the underlying deep learning models. The first approach consists of leveraging a specialized, lightweight model to directly answer the query. The second approach focuses on filtering irrelevant frames using a lightweight model and processing the filtered fra… ▽ More To efficiently process visual data at scale, researchers have proposed two techniques for lowering the computational overhead associated with the underlying deep learning models. The first approach consists of leveraging a specialized, lightweight model to directly answer the query. The second approach focuses on filtering irrelevant frames using a lightweight model and processing the filtered frames using a heavyweight model. These techniques suffer from two limitations. With the first approach, the specialized model is unable to provide accurate results for hard-to-detect events. With the second approach, the system is unable to accelerate queries focusing on frequently occurring events as the filter is unable to eliminate a significant fraction of frames in the video. In this paper, we present THIA, a video analytics system for tackling these limitations. The design of THIA is centered around three techniques. First, instead of using a cascade of models, it uses a single object detection model with multiple exit points for short-circuiting the inference. This early inference technique allows it to support a range of throughput-accuracy tradeoffs. Second, it adopts a fine-grained approach to planning and processes different chunks of the video using different exit points to meet the user's requirements. Lastly, it uses a lightweight technique for directly estimating the exit point for a chunk to lower the optimization time. We empirically show that these techniques enable THIA to outperform two state-of-the-art video analytics systems by up to 6.5X, while providing accurate results even on queries focusing on hard-to-detect events. △ Less

Submitted 16 February, 2021; originally announced February 2021.

arXiv:2011.10932 [pdf, other]

doi 10.1109/IISWC53511.2021.00012

Copernicus: Characterizing the Performance Implications of Compression Formats Used in Sparse Workloads

Authors: Bahar Asgari, Ramyad Hadidi, Joshua Dierberger, Charlotte Steinichen, Amaan Marfatia, Hyesoon Kim

Abstract: Sparse matrices are the key ingredients of several application domains, from scientific computation to machine learning. The primary challenge with sparse matrices has been efficiently storing and transferring data, for which many sparse formats have been proposed to significantly eliminate zero entries. Such formats, essentially designed to optimize memory footprint, may not be as successful in p… ▽ More Sparse matrices are the key ingredients of several application domains, from scientific computation to machine learning. The primary challenge with sparse matrices has been efficiently storing and transferring data, for which many sparse formats have been proposed to significantly eliminate zero entries. Such formats, essentially designed to optimize memory footprint, may not be as successful in performing faster processing. In other words, although they allow faster data transfer and improve memory bandwidth utilization -- the classic challenge of sparse problems -- their decompression mechanism can potentially create a computation bottleneck. Not only is this challenge not resolved, but also it becomes more serious with the advent of domain-specific architectures (DSAs), as they intend to more aggressively improve performance. The performance implications of using various formats along with DSAs, however, has not been extensively studied by prior work. To fill this gap of knowledge, we characterize the impact of using seven frequently used sparse formats on performance, based on a DSA for sparse matrix-vector multiplication (SpMV), implemented on an FPGA using high-level synthesis (HLS) tools, a growing and popular method for develo** DSAs. Seeking a fair comparison, we tailor and optimize the HLS implementation of decompression for each format. We thoroughly explore diverse metrics, including decompression overhead, latency, balance ratio, throughput, memory bandwidth utilization, resource utilization, and power consumption, on a variety of real-world and synthetic sparse workloads. △ Less

Submitted 18 October, 2021; v1 submitted 21 November, 2020; originally announced November 2020.

Comments: 11 pages, 14 figures, 2 tables

arXiv:2011.08936 [pdf, other]

Secure Location-Aware Authentication and Communication for Intelligent Transportation Systems

Authors: Nima Shoghi Ghalehshahi, Ramyad Hadidi, Lee Jaewon, Jun Chen, Arthur Siqueria, Rahul Rajan, Shaan Dhawan, Pooya Shoghi Ghalehshahi, Hyesoon Kim

Abstract: Intelligent transportation systems (ITS) are expected to effectively create a stand-alone network for secure communication among autonomous agents. In such a dynamic and fast-changing network with high-speed agents, verifying the authenticity and integrity of messages while taking preventive action (e.g., applying brakes) within tens of milliseconds is one of the main challenges. In such a brief m… ▽ More Intelligent transportation systems (ITS) are expected to effectively create a stand-alone network for secure communication among autonomous agents. In such a dynamic and fast-changing network with high-speed agents, verifying the authenticity and integrity of messages while taking preventive action (e.g., applying brakes) within tens of milliseconds is one of the main challenges. In such a brief moment after receiving a message, the agent not only must verify the integrity and authenticity of the received message but also needs to perform extra computations to localize the sender of the message for taking appropriate action (e.g., an immediate stop warning from a vehicle in front vs. rear). In this paper, we present an inherently location-aware and lightweight authentication protocol by exploiting in situ visual localization (i.e., SLAM). In this protocol, each agent displays its public key using visual authentication beacons (e.g., QR codes). Thus, receiving agents not only can verify and authenticate the messages but also can easily localize the sender by kee** a shortlist of observed visual beacons within their visual localization system with no additional computation cost. Compared to prior work, our location-aware protocol is scalable, does not depend on any infrastructure, removes the high cost of post-message-delivery localization, and provides trustworthiness guarantees for information that are beyond the reach of each agent sensors. △ Less

Submitted 17 November, 2020; originally announced November 2020.

arXiv:2011.07092 [pdf, other]

Reducing Inference Latency with Concurrent Architectures for Image Recognition

Authors: Ramyad Hadidi, Jiashen Cao, Michael S. Ryoo, Hyesoon Kim

Abstract: Satisfying the high computation demand of modern deep learning architectures is challenging for achieving low inference latency. The current approaches in decreasing latency only increase parallelism within a layer. This is because architectures typically capture a single-chain dependency pattern that prevents efficient distribution with a higher concurrency (i.e., simultaneous execution of one in… ▽ More Satisfying the high computation demand of modern deep learning architectures is challenging for achieving low inference latency. The current approaches in decreasing latency only increase parallelism within a layer. This is because architectures typically capture a single-chain dependency pattern that prevents efficient distribution with a higher concurrency (i.e., simultaneous execution of one inference among devices). Such single-chain dependencies are so widespread that even implicitly biases recent neural architecture search (NAS) studies. In this visionary paper, we draw attention to an entirely new space of NAS that relaxes the single-chain dependency to provide higher concurrency and distribution opportunities. To quantitatively compare these architectures, we propose a score that encapsulates crucial metrics such as communication, concurrency, and load balancing. Additionally, we propose a new generator and transformation block that consistently deliver superior architectures compared to current state-of-the-art methods. Finally, our preliminary results show that these new architectures reduce the inference latency and deserve more attention. △ Less

Submitted 13 November, 2020; originally announced November 2020.

arXiv:2003.06464 [pdf, other]

LCP: A Low-Communication Parallelization Method for Fast Neural Network Inference in Image Recognition

Authors: Ramyad Hadidi, Bahar Asgari, Jiashen Cao, Younmin Bae, Da Eun Shim, Hyojong Kim, Sung-Kyu Lim, Michael S. Ryoo, Hyesoon Kim

Abstract: Deep neural networks (DNNs) have inspired new studies in myriad edge applications with robots, autonomous agents, and Internet-of-things (IoT) devices. However, performing inference of DNNs in the edge is still a severe challenge, mainly because of the contradiction between the intensive resource requirements of DNNs and the tight resource availability in several edge domains. Further, as communic… ▽ More Deep neural networks (DNNs) have inspired new studies in myriad edge applications with robots, autonomous agents, and Internet-of-things (IoT) devices. However, performing inference of DNNs in the edge is still a severe challenge, mainly because of the contradiction between the intensive resource requirements of DNNs and the tight resource availability in several edge domains. Further, as communication is costly, taking advantage of other available edge devices by using data- or model-parallelism methods is not an effective solution. To benefit from available compute resources with low communication overhead, we propose the first DNN parallelization method for reducing the communication overhead in a distributed system. We propose a low-communication parallelization (LCP) method in which models consist of several almost-independent and narrow branches. LCP offers close-to-minimum communication overhead with better distribution and parallelization opportunities while significantly reducing memory footprint and computation compared to data- and model-parallelism methods. We deploy LCP models on three distributed systems: AWS instances, Raspberry Pis, and PYNQ boards. We also evaluate the performance of LCP models on a customized hardware (tailored for low latency) implemented on a small edge FPGA and as a 16mW 0.107mm2 ASIC @7nm chip. LCP models achieve a maximum and average speedups of 56x and 7x, compared to the originals, which could be improved by up to an average speedup of 33x by incorporating common optimizations such as pruning and quantization. △ Less

Submitted 17 November, 2020; v1 submitted 13 March, 2020; originally announced March 2020.

arXiv:1901.02537 [pdf, other]

Collaborative Execution of Deep Neural Networks on Internet of Things Devices

Authors: Ramyad Hadidi, Jiashen Cao, Micheal S. Ryoo, Hyesoon Kim

Abstract: With recent advancements in deep neural networks (DNNs), we are able to solve traditionally challenging problems. Since DNNs are compute intensive, consumers, to deploy a service, need to rely on expensive and scarce compute resources in the cloud. This approach, in addition to its dependability on high-quality network infrastructure and data centers, raises new privacy concerns. These challenges… ▽ More With recent advancements in deep neural networks (DNNs), we are able to solve traditionally challenging problems. Since DNNs are compute intensive, consumers, to deploy a service, need to rely on expensive and scarce compute resources in the cloud. This approach, in addition to its dependability on high-quality network infrastructure and data centers, raises new privacy concerns. These challenges may limit DNN-based applications, so many researchers have tried optimize DNNs for local and in-edge execution. However, inadequate power and computing resources of edge devices along with small number of requests limits current optimizations applicability, such as batch processing. In this paper, we propose an approach that utilizes aggregated existing computing power of Internet of Things (IoT) devices surrounding an environment by creating a collaborative network. In this approach, IoT devices cooperate to conduct single-batch inferencing in real time. While exploiting several new model-parallelism methods and their distribution characteristics, our approach enhances the collaborative network by creating a balanced and distributed processing pipeline. We have illustrated our work using many Raspberry Pis with studying DNN models such as AlexNet, VGG16, Xception, and C3D. △ Less

Submitted 8 January, 2019; originally announced January 2019.

Comments: Updated version after sysML

arXiv:1802.02138 [pdf, other]

Musical Chair: Efficient Real-Time Recognition Using Collaborative IoT Devices

Authors: Ramyad Hadidi, Jiashen Cao, Matthew Woodward, Michael S. Ryoo, Hyesoon Kim

Abstract: The prevalence of Internet of things (IoT) devices and abundance of sensor data has created an increase in real-time data processing such as recognition of speech, image, and video. While currently such processes are offloaded to the computationally powerful cloud system, a localized and distributed approach is desirable because (i) it preserves the privacy of users and (ii) it omits the dependenc… ▽ More The prevalence of Internet of things (IoT) devices and abundance of sensor data has created an increase in real-time data processing such as recognition of speech, image, and video. While currently such processes are offloaded to the computationally powerful cloud system, a localized and distributed approach is desirable because (i) it preserves the privacy of users and (ii) it omits the dependency on cloud services. However, IoT networks are usually composed of resource-constrained devices, and a single device is not powerful enough to process real-time data. To overcome this challenge, we examine data and model parallelism for such devices in the context of deep neural networks. We propose Musical Chair to enable efficient, localized, and dynamic real-time recognition by harvesting the aggregated computational power from the resource-constrained devices in the same IoT network as input sensors. Musical chair adapts to the availability of computing devices at runtime and adjusts to the inherit dynamics of IoT networks. To demonstrate Musical Chair, on a network of Raspberry PIs (up to 12) each connected to a camera, we implement a state-of-the-art action recognition model for videos and two recognition models for images. Compared to the Tegra TX2, an embedded low-power platform with a six-core CPU and a GPU, our distributed action recognition system achieves not only similar energy consumption but also twice the performance of the TX2. Furthermore, in image recognition, Musical Chair achieves similar performance and saves dynamic energy. △ Less

Submitted 21 March, 2018; v1 submitted 5 February, 2018; originally announced February 2018.

arXiv:1710.09517 [pdf, other]

doi 10.1145/3232521

CODA: Enabling Co-location of Computation and Data for Near-Data Processing

Authors: Hyojong Kim, Ramyad Hadidi, Lifeng Nai, Hyesoon Kim, Nuwan Jayasena, Yasuko Eckert, Onur Kayiran, Gabriel H. Loh

Abstract: Recent studies have demonstrated that near-data processing (NDP) is an effective technique for improving performance and energy efficiency of data-intensive workloads. However, leveraging NDP in realistic systems with multiple memory modules introduces a new challenge. In today's systems, where no computation occurs in memory modules, the physical address space is interleaved at a fine granularity… ▽ More Recent studies have demonstrated that near-data processing (NDP) is an effective technique for improving performance and energy efficiency of data-intensive workloads. However, leveraging NDP in realistic systems with multiple memory modules introduces a new challenge. In today's systems, where no computation occurs in memory modules, the physical address space is interleaved at a fine granularity among all memory modules to help improve the utilization of processor-memory interfaces by distributing the memory traffic. However, this is at odds with efficient use of NDP, which requires careful placement of data in memory modules such that near-data computations and their exclusively used data can be localized in individual memory modules, while distributing shared data among memory modules to reduce hotspots. In order to address this new challenge, we propose a set of techniques that (1) enable collections of OS pages to either be fine-grain interleaved among memory modules (as is done today) or to be placed contiguously on individual memory modules (as is desirable for NDP private data), and (2) decide whether to localize or distribute each memory object based on its anticipated access pattern and steer computations to the memory where the data they access is located. Our evaluations across a wide range of workloads show that the proposed mechanism improves performance by 31% and reduces 38% remote data accesses over a baseline system that cannot exploit computate-data affinity characteristics. △ Less

Submitted 25 October, 2017; originally announced October 2017.

Comments: 14 pages, 16 figures

Journal ref: ACM Transactions on Architecture and Code Optimization (TACO) Volume 15 Issue 3, October 2018 Article No. 32

arXiv:1707.05399 [pdf, other]

doi 10.1109/ISPASS.2018.00018

Performance Implications of NoCs on 3D-Stacked Memories: Insights from the Hybrid Memory Cube

Authors: Ramyad Hadidi, Bahar Asgari, Jeffrey Young, Burhan Ahmad Mudassar, Kartikay Garg, Tushar Krishna, Hyesoon Kim

Abstract: Memories that exploit three-dimensional (3D)-stacking technology, which integrate memory and logic dies in a single stack, are becoming popular. These memories, such as Hybrid Memory Cube (HMC), utilize a network-on-chip (NoC) design for connecting their internal structural organizations. This novel usage of NoC, in addition to aiding processing-in-memory capabilities, enables numerous benefits su… ▽ More Memories that exploit three-dimensional (3D)-stacking technology, which integrate memory and logic dies in a single stack, are becoming popular. These memories, such as Hybrid Memory Cube (HMC), utilize a network-on-chip (NoC) design for connecting their internal structural organizations. This novel usage of NoC, in addition to aiding processing-in-memory capabilities, enables numerous benefits such as high bandwidth and memory-level parallelism. However, the implications of NoCs on the characteristics of 3D-stacked memories in terms of memory access latency and bandwidth have not been fully explored. This paper addresses this knowledge gap by (i) characterizing an HMC prototype on the AC-510 accelerator board and revealing its access latency behaviors, and (ii) by investigating the implications of such behaviors on system and software designs. △ Less

Submitted 13 February, 2018; v1 submitted 17 July, 2017; originally announced July 2017.

Journal ref: 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

arXiv:1706.02725 [pdf, other]

doi 10.1109/IISWC.2017.8167757

Demystifying the Characteristics of 3D-Stacked Memories: A Case Study for Hybrid Memory Cube

Authors: Ramyad Hadidi, Bahar Asgari, Burhan Ahmad Mudassar, Saibal Mukhopadhyay, Sudhakar Yalamanchili, Hyesoon Kim

Abstract: Three-dimensional (3D)-stacking technology, which enables the integration of DRAM and logic dies, offers high bandwidth and low energy consumption. This technology also empowers new memory designs for executing tasks not traditionally associated with memories. A practical 3D-stacked memory is Hybrid Memory Cube (HMC), which provides significant access bandwidth and low power consumption in a small… ▽ More Three-dimensional (3D)-stacking technology, which enables the integration of DRAM and logic dies, offers high bandwidth and low energy consumption. This technology also empowers new memory designs for executing tasks not traditionally associated with memories. A practical 3D-stacked memory is Hybrid Memory Cube (HMC), which provides significant access bandwidth and low power consumption in a small area. Although several studies have taken advantage of the novel architecture of HMC, its characteristics in terms of latency and bandwidth or their correlation with temperature and power consumption have not been fully explored. This paper is the first, to the best of our knowledge, to characterize the thermal behavior of HMC in a real environment using the AC-510 accelerator and to identify temperature as a new limitation for this state-of-the-art design space. Moreover, besides bandwidth studies, we deconstruct factors that contribute to latency and reveal their sources for high- and low-load accesses. The results of this paper demonstrates essential behaviors and performance bottlenecks for future explorations of packet-switched and 3D-stacked memories. △ Less

Submitted 3 October, 2017; v1 submitted 8 June, 2017; originally announced June 2017.

Comments: EEE Catalog Number: CFP17236-USB ISBN 13: 978-1-5386-1232-3

Journal ref: Proceedings of the 2017 IEEE International Symposium on Workload Characterization

Showing 1–14 of 14 results for author: Hadidi, R