Search | arXiv e-print repository

arXiv:2403.20297 [pdf, other]

Balanced Data Placement for GEMV Acceleration with Processing-In-Memory

Authors: Mohamed Assem Ibrahim, Mahzabeen Islam, Shaizeen Aga

Abstract: With unprecedented demand for generative AI (GenAI) inference, acceleration of primitives that dominate GenAI such as general matrix-vector multiplication (GEMV) is receiving considerable attention. A challenge with GEMVs is the high memory bandwidth this primitive demands. Multiple memory vendors have proposed commercially viable processing-in-memory (PIM) prototypes that attain bandwidth boost o… ▽ More With unprecedented demand for generative AI (GenAI) inference, acceleration of primitives that dominate GenAI such as general matrix-vector multiplication (GEMV) is receiving considerable attention. A challenge with GEMVs is the high memory bandwidth this primitive demands. Multiple memory vendors have proposed commercially viable processing-in-memory (PIM) prototypes that attain bandwidth boost over processor via augmenting memory banks with compute capabilities and broadcasting same command to all banks. While proposed PIM designs stand to accelerate GEMV, we observe in this work that a key impediment to truly harness PIM acceleration is deducing optimal data-placement to place the matrix in memory banks. To this end, we tease out several factors that impact data-placement and propose PIMnast methodology which, like a gymnast, balances these factors to identify data-placements that deliver GEMV acceleration. Across a spectrum of GenAI models, our proposed PIMnast methodology along with additional orchestration knobs we identify delivers up to 6.86$\times$ speedup for GEMVs (of the available 7$\times$ roofline speedup) leading to up to 5$\times$ speedup for per-token latencies. △ Less

Submitted 1 April, 2024; v1 submitted 29 March, 2024; originally announced March 2024.

arXiv:2401.16677 [pdf, other]

T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives

Authors: Suchita Pati, Shaizeen Aga, Mahzabeen Islam, Nuwan Jayasena, Matthew D. Sinclair

Abstract: Large Language Models increasingly rely on distributed techniques for their training and inference. These techniques require communication across devices which can reduce scaling efficiency as the number of devices increases. While some distributed techniques can overlap, and thus, hide this communication with independent computations, techniques such as Tensor Parallelism (TP) inherently serializ… ▽ More Large Language Models increasingly rely on distributed techniques for their training and inference. These techniques require communication across devices which can reduce scaling efficiency as the number of devices increases. While some distributed techniques can overlap, and thus, hide this communication with independent computations, techniques such as Tensor Parallelism (TP) inherently serialize communication with model execution. One approach to hide this serialized communication is to interleave it with the producer operation (of the communicated data) in a fine-grained manner. However, this fine-grained interleaving of communication and computation in software can be difficult. Furthermore, as with any concurrent execution, it requires compute and memory resources to be shared between computation and communication, causing resource contention that reduces overlap** efficacy. To overcome these challenges, we propose T3 which applies hardware-software co-design to transparently overlap serialized communication while minimizing resource contention with compute. T3 transparently fuses producer operations with the subsequent communication via a simple configuration of the producer's output address space and requires minor software changes. At the hardware level, T3 adds a lightweight track and trigger mechanism to orchestrate the producer's compute, and communication. It further uses compute-enhanced memories for communication's attendant compute. As a result, T3 reduces resource contention, and efficiently overlaps serialized communication with computation. For important Transformer models like T-NLG, T3 speeds up communication-heavy sublayers by 30% geomean (max 47%) and reduces data movement by 22% geomean (max 36%). Furthermore, T3's benefits persist as models scale: geomean 29% for sublayers in $\sim$500-billion parameter models, PALM and MT-NLG. △ Less

Submitted 29 January, 2024; originally announced January 2024.

Comments: To appear at the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) 2024

ACM Class: C.2.4; C.1.2

arXiv:2311.05034 [pdf, other]

Just-in-time Quantization with Processing-In-Memory for Efficient ML Training

Authors: Mohamed Assem Ibrahim, Shaizeen Aga, Ada Li, Suchita Pati, Mahzabeen Islam

Abstract: Data format innovations have been critical for machine learning (ML) scaling, which in turn fuels ground-breaking ML capabilities. However, even in the presence of low-precision formats, model weights are often stored in both high-precision and low-precision during training. Furthermore, with emerging directional data formats (e.g., MX9, MX6, etc.) multiple low-precision weight copies can be requi… ▽ More Data format innovations have been critical for machine learning (ML) scaling, which in turn fuels ground-breaking ML capabilities. However, even in the presence of low-precision formats, model weights are often stored in both high-precision and low-precision during training. Furthermore, with emerging directional data formats (e.g., MX9, MX6, etc.) multiple low-precision weight copies can be required. To lower memory capacity needs of weights, we explore just-in-time quantization (JIT-Q) where we only store high-precision weights in memory and generate low-precision weights only when needed. To perform JIT-Q efficiently, in this work, we evaluate emerging processing-in-memory (PIM) technology to execute quantization. With PIM, we can offload quantization to in-memory compute units enabling quantization to be performed without incurring costly data movement while allowing quantization to be concurrent with accelerator computation. Our proposed PIM-offloaded quantization keeps up with GPU compute and delivers considerable capacity savings (up to 24\%) at marginal throughput loss (up to 2.4\%). Said memory capacity savings can unlock several benefits such as fitting larger model in the same system, reducing model parallelism requirement, and improving overall ML training efficiency. △ Less

Submitted 8 November, 2023; originally announced November 2023.

arXiv:2309.07984 [pdf, other]

Inclusive-PIM: Hardware-Software Co-design for Broad Acceleration on Commercial PIM Architectures

Authors: Johnathan Alsop, Shaizeen Aga, Mohamed Ibrahim, Mahzabeen Islam, Andrew Mccrabb, Nuwan Jayasena

Abstract: Continual demand for memory bandwidth has made it worthwhile for memory vendors to reassess processing in memory (PIM), which enables higher bandwidth by placing compute units in/near-memory. As such, memory vendors have recently proposed commercially viable PIM designs. However, these proposals are largely driven by the needs of (a narrow set of) machine learning (ML) primitives. While such propo… ▽ More Continual demand for memory bandwidth has made it worthwhile for memory vendors to reassess processing in memory (PIM), which enables higher bandwidth by placing compute units in/near-memory. As such, memory vendors have recently proposed commercially viable PIM designs. However, these proposals are largely driven by the needs of (a narrow set of) machine learning (ML) primitives. While such proposals are reasonable given the the growing importance of ML, as memory is a pervasive component, %in this work, we make there is a case for a more inclusive PIM design that can accelerate primitives across domains. In this work, we ascertain the capabilities of commercial PIM proposals to accelerate various primitives across domains. We first begin with outlining a set of characteristics, termed PIM-amenability-test, which aid in assessing if a given primitive is likely to be accelerated by PIM. Next, we apply this test to primitives under study to ascertain efficient data-placement and orchestration to map the primitives to underlying PIM architecture. We observe here that, even though primitives under study are largely PIM-amenable, existing commercial PIM proposals do not realize their performance potential for these primitives. To address this, we identify bottlenecks that arise in PIM execution and propose hardware and software optimizations which stand to broaden the acceleration reach of commercial PIM designs (improving average PIM speedups from 1.12x to 2.49x relative to a GPU baseline). Overall, while we believe emerging commercial PIM proposals add a necessary and complementary design point in the application acceleration space, hardware-software co-design is necessary to deliver their benefits broadly. △ Less

Submitted 17 January, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

arXiv:2308.03973 [pdf, other]

Collaborative Acceleration for FFT on Commercial Processing-In-Memory Architectures

Authors: Mohamed Assem Ibrahim, Shaizeen Aga

Abstract: This paper evaluates the efficacy of recent commercial processing-in-memory (PIM) solutions to accelerate fast Fourier transform (FFT), an important primitive across several domains. Specifically, we observe that efficient implementations of FFT on modern GPUs are memory bandwidth bound. As such, the memory bandwidth boost availed by commercial PIM solutions makes a case for PIM to accelerate FFT.… ▽ More This paper evaluates the efficacy of recent commercial processing-in-memory (PIM) solutions to accelerate fast Fourier transform (FFT), an important primitive across several domains. Specifically, we observe that efficient implementations of FFT on modern GPUs are memory bandwidth bound. As such, the memory bandwidth boost availed by commercial PIM solutions makes a case for PIM to accelerate FFT. To this end, we first deduce a map** of FFT computation to a strawman PIM architecture representative of recent commercial designs. We observe that even with careful data map**, PIM is not effective in accelerating FFT. To address this, we make a case for collaborative acceleration of FFT with PIM and GPU. Further, we propose software and hardware innovations which lower PIM operations necessary for a given FFT. Overall, our optimized PIM FFT map**, termed Pimacolaba, delivers performance and data movement savings of up to 1.38$\times$ and 2.76$\times$, respectively, over a range of FFT sizes. △ Less

Submitted 7 August, 2023; originally announced August 2023.

arXiv:2304.09411 [pdf, other]

Egalitarian ORAM: Wear-Leveling for ORAM

Authors: Yi Zheng, Aasheesh Kolli, Shaizeen Aga

Abstract: While non-volatile memories (NVMs) provide several desirable characteristics like better density and comparable energy efficiency than DRAM, DRAM-like performance, and disk-like durability, the limited endurance NVMs manifest remains a challenge with these memories. Indeed, the endurance constraints of NVMs can prevent solutions that are commonly employed for other mainstream memories like DRAM fr… ▽ More While non-volatile memories (NVMs) provide several desirable characteristics like better density and comparable energy efficiency than DRAM, DRAM-like performance, and disk-like durability, the limited endurance NVMs manifest remains a challenge with these memories. Indeed, the endurance constraints of NVMs can prevent solutions that are commonly employed for other mainstream memories like DRAM from being carried over as-is to NVMs. Specifically, in this work we observe that, Oblivious RAM (ORAM) primitive, the state-ofart solution to tackle memory bus side channel vulnerability, while widely studied for DRAMs, is particularly challenging to implement as-is for NVMs as it severely affects endurance of NVMs. This is so, as the inherent nature of ORAM primitive causes an order of magnitude increase in write traffic and furthermore, causes some regions of memory to be written far more often than others. This non-uniform write traffic as manifested by ORAM primitive stands to severely affect the lifetime of non-volatile memories (1% of baseline without ORAM) to even make it impractical to address this security vulnerability △ Less

Submitted 18 April, 2023; originally announced April 2023.

arXiv:2302.02825 [pdf]

Computation vs. Communication Scaling for Future Transformers on Future Hardware

Authors: Suchita Pati, Shaizeen Aga, Mahzabeen Islam, Nuwan Jayasena, Matthew D. Sinclair

Abstract: Scaling neural network models has delivered dramatic quality gains across ML problems. However, this scaling has increased the reliance on efficient distributed training techniques. Accordingly, as with other distributed computing scenarios, it is important to understand how will compute and communication scale relative to one another as models scale and hardware evolves? A careful study which ans… ▽ More Scaling neural network models has delivered dramatic quality gains across ML problems. However, this scaling has increased the reliance on efficient distributed training techniques. Accordingly, as with other distributed computing scenarios, it is important to understand how will compute and communication scale relative to one another as models scale and hardware evolves? A careful study which answers this question can better guide the design of future systems which can efficiently train future large models. Accordingly, this work provides a comprehensive multi-axial (algorithmic, empirical, hardware evolution) analysis of compute vs. communication (Comp-vs.-Comm) scaling for future Transformer models on future hardware. First, our algorithmic analysis shows that compute generally enjoys an edge over communication as models scale. However, since memory capacity scales slower than compute, these trends are being stressed. Next, we quantify this edge by empirically studying how Comp-vs.-Comm scales for future models on future hardware. To avoid profiling numerous Transformer models across many setups, we extract execution regions and project costs using operator models. This allows a spectrum (hundreds) of future model/hardware scenarios to be accurately studied ($<$15% error), and reduces profiling costs by 2100$\times$. Our experiments show that communication will be a significant portion (40-75%) of runtime as models and hardware evolve. Moreover, communication which is hidden by overlapped computation in today's models often cannot be hidden in future, larger models. Overall, this work highlights the increasingly large role communication will play as models scale and discusses techniques and upcoming technologies that can help address it. △ Less

Submitted 2 May, 2023; v1 submitted 6 February, 2023; originally announced February 2023.

ACM Class: C.4; C.2.4

arXiv:2104.08335 [pdf]

Demystifying BERT: Implications for Accelerator Design

Authors: Suchita Pati, Shaizeen Aga, Nuwan Jayasena, Matthew D. Sinclair

Abstract: Transfer learning in natural language processing (NLP), as realized using models like BERT (Bi-directional Encoder Representation from Transformer), has significantly improved language representation with models that can tackle challenging language problems. Consequently, these applications are driving the requirements of future systems. Thus, we focus on BERT, one of the most popular NLP transfer… ▽ More Transfer learning in natural language processing (NLP), as realized using models like BERT (Bi-directional Encoder Representation from Transformer), has significantly improved language representation with models that can tackle challenging language problems. Consequently, these applications are driving the requirements of future systems. Thus, we focus on BERT, one of the most popular NLP transfer learning algorithms, to identify how its algorithmic behavior can guide future accelerator design. To this end, we carefully profile BERT training and identify key algorithmic behaviors which are worthy of attention in accelerator design. We observe that while computations which manifest as matrix multiplication dominate BERT's overall runtime, as in many convolutional neural networks, memory-intensive computations also feature prominently. We characterize these computations, which have received little attention so far. Further, we also identify heterogeneity in compute-intensive BERT computations and discuss software and possible hardware mechanisms to further optimize these computations. Finally, we discuss implications of these behaviors as networks get larger and use distributed training environments, and how techniques such as micro-batching and mixed-precision training scale. Overall, our analysis identifies holistic solutions to optimize systems for BERT-like models. △ Less

Submitted 13 April, 2021; originally announced April 2021.

ACM Class: C.3; C.4

arXiv:2007.10459 [pdf]

SeqPoint: Identifying Representative Iterations of Sequence-based Neural Networks

Authors: Suchita Pati, Shaizeen Aga, Matthew D. Sinclair, Nuwan Jayasena

Abstract: The ubiquity of deep neural networks (DNNs) continues to rise, making them a crucial application class for hardware optimizations. However, detailed profiling and characterization of DNN training remains difficult as these applications often run for hours to days on real hardware. Prior works exploit the iterative nature of DNNs to profile a few training iterations. While such a strategy is sound… ▽ More The ubiquity of deep neural networks (DNNs) continues to rise, making them a crucial application class for hardware optimizations. However, detailed profiling and characterization of DNN training remains difficult as these applications often run for hours to days on real hardware. Prior works exploit the iterative nature of DNNs to profile a few training iterations. While such a strategy is sound for networks like convolutional neural networks (CNNs), where the nature of the computation is largely input independent, we observe in this work that this approach is sub-optimal for sequence-based neural networks (SQNNs) such as recurrent neural networks (RNNs). The amount and nature of computations in SQNNs can vary for each input, resulting in heterogeneity across iterations. Thus, arbitrarily selecting a few iterations is insufficient to accurately summarize the behavior of the entire training run. To tackle this challenge, we carefully study the factors that impact SQNN training iterations and identify input sequence length as the key determining factor for variations across iterations. We then use this observation to characterize all iterations of an SQNN training run (requiring no profiling or simulation of the application) and select representative iterations, which we term SeqPoints. We analyze two state-of-the-art SQNNs, DeepSpeech2 and Google's Neural Machine Translation (GNMT), and show that SeqPoints can represent their entire training runs accurately, resulting in geomean errors of only 0.11% and 0.53%, respectively, when projecting overall runtime and 0.13% and 1.50% when projecting speedups due to architectural changes. This high accuracy is achieved while reducing the time needed for profiling by 345x and 214x for the two networks compared to full training runs. As a result, SeqPoint can enable analysis of SQNN training runs in mere minutes instead of hours or days. △ Less

Submitted 20 July, 2020; originally announced July 2020.

Comments: To appear in IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2020)

ACM Class: C.4

Showing 1–9 of 9 results for author: Aga, S