Search | arXiv e-print repository

Efficient Sparse Processing-in-Memory Architecture (ESPIM) for Machine Learning Inference

Authors: Mingxuan He, Mithuna Thottethodi, T. N. Vijaykumar

Abstract: Emerging machine learning (ML) models (e.g., transformers) involve memory pin bandwidth-bound matrix-vector (MV) computation in inference. By avoiding pin crossings, processing in memory (PIM) can improve performance and energy for pin-bound workloads, as evidenced by recent commercial efforts in (digital) PIM. Sparse models can improve performance and energy of inference without losing much accur… ▽ More Emerging machine learning (ML) models (e.g., transformers) involve memory pin bandwidth-bound matrix-vector (MV) computation in inference. By avoiding pin crossings, processing in memory (PIM) can improve performance and energy for pin-bound workloads, as evidenced by recent commercial efforts in (digital) PIM. Sparse models can improve performance and energy of inference without losing much accuracy. However, unstructured sparse inference injects the key challenges of uncertainty, irregularity, and load imbalance into a dense PIM's operation across all the banks. The dense PIM reads the matrix cells from each bank and broadcasts the vector elements to all the banks exploiting DRAM organization. To address these challenges efficiently, we propose ESPIM which makes four contributions: (1) Because matrix sparsity increases the vector broadcast bandwidth demand per matrix column-read, ESPIM employs a fine-grained interleaving of the matrix cells so that each vector broadcast is shared among multiple rows in each bank, cutting the bandwidth demand. (2) ESPIM mostly avoids on-chip control's area and energy despite sparsity's uncertainties by exploiting the observation that the sparsity is data-dependent but static and known before inference. Accordingly, ESPIM employs static data-dependent scheduling (SDDS) (3) ESPIM decouples the matrix cell values and their indices, placing the indices ahead of the values to enable prefetching of the vector elements. We extend SDDS for performance and correctness with the decoupled prefetching. (4) Finally, we simplify the switch required to select the vector elements that match the matrix cells. We extend SDDS to improve performance by reducing conflicts in the simplified switch. In our simulations, ESPIM achieves 2x average (up to 4.2x) speedup over and 34% average (up to 63%) lower energy than Newton while incurring under 5% area. △ Less

Submitted 6 April, 2024; originally announced April 2024.

arXiv:2404.03113 [pdf, other]

QED: Scalable Verification of Hardware Memory Consistency

Authors: Gokulan Ravi, Xiaokang Qiu, Mithuna Thottethodi, T. N. Vijaykumar

Abstract: Memory consistency model (MCM) issues in out-of-order-issue microprocessor-based shared-memory systems are notoriously non-intuitive and a source of hardware design bugs. Prior hardware verification work is limited to in-order-issue processors, to proving the correctness only of some test cases, or to bounded verification that does not scale in practice beyond 7 instructions across all threads. Be… ▽ More Memory consistency model (MCM) issues in out-of-order-issue microprocessor-based shared-memory systems are notoriously non-intuitive and a source of hardware design bugs. Prior hardware verification work is limited to in-order-issue processors, to proving the correctness only of some test cases, or to bounded verification that does not scale in practice beyond 7 instructions across all threads. Because cache coherence (i.e., write serialization and atomicity) and pipeline front-end verification and testing are well-studied, we focus on the memory ordering in an out-of-order-issue processor's load-store queue and the coherence interface between the core and global coherence. We propose QED based on the key notion of observability that any hardware reordering matters only if a forbidden value is produced. We argue that one needs to consider (1) only directly-ordered instruction pairs -- transitively non-redundant pairs connected by an edge in the MCM-imposed partial order -- and not all in-flight instructions, and (2) only the ordering of external events from other cores (e.g.,invalidations) but not the events' originating cores, achieving verification scalability in both the numbers of in-flight memory instructions and of cores. Exhaustively considering all pairs of instruction types and all types of external events intervening between each pair, QED attempts to restore any reordered instructions to an MCM-complaint order without changing the execution values, where failure indicates an MCM violation. Each instruction pair's exploration results in a decision tree of simple, narrowly-defined predicates to be evaluated against the RTL. In our experiments, we automatically generate the decision trees for SC, TSO, and RISC-V WMO, and illustrate automatable verification by evaluating a substantial predicate against BOOMv3 implementation of RISC-V WMO, leaving full automation to future work. △ Less

Submitted 3 April, 2024; originally announced April 2024.

Comments: 13 pages, 8 figures

arXiv:2306.07785 [pdf, other]

SafeBet: Secure, Simple, and Fast Speculative Execution

Authors: Conor Green, Cole Nelson, Mithuna Thottethodi, T. N. Vijaykumar

Abstract: Spectre attacks exploit microprocessor speculative execution to read and transmit forbidden data outside the attacker's trust domain and sandbox. Recent hardware schemes allow potentially-unsafe speculative accesses but prevent the secret's transmission by delaying most access-dependent instructions even in the predominantly-common, no-attack case, which incurs performance loss and hardware comple… ▽ More Spectre attacks exploit microprocessor speculative execution to read and transmit forbidden data outside the attacker's trust domain and sandbox. Recent hardware schemes allow potentially-unsafe speculative accesses but prevent the secret's transmission by delaying most access-dependent instructions even in the predominantly-common, no-attack case, which incurs performance loss and hardware complexity. Instead, we propose SafeBet which allows only, and does not delay most, safe accesses, achieving both security and high performance. SafeBet is based on the key observation that speculatively accessing a destination location is safe if the location's access by the same static trust domain has been committed previously; and potentially unsafe, otherwise. We extend this observation to handle inter trust-domain code and data interactions. SafeBet employs the Speculative Memory Access Control Table (SMACT) to track non-speculative trust domain code region-destination pairs. Disallowed accesses wait until reaching commit to trigger well-known replay, with virtually no change to the pipeline. Software simulations using SpecCPU benchmarks show that SafeBet uses an 8.3-KB SMACT per core to perform within 6% on average (63% at worst) of the unsafe baseline behind which NDA-restrictive, a previous scheme of security and hardware complexity comparable to SafeBet's, lags by 83% on average. △ Less

Submitted 13 June, 2023; originally announced June 2023.

arXiv:2106.14138 [pdf, other]

OCCAM: Optimal Data Reuse for Convolutional Neural Networks

Authors: Ashish Gondimalla, Jianqiao Liu, T. N. Vijaykumar, Mithuna Thottethodi

Abstract: Convolutional neural networks (CNNs) are emerging as powerful tools for image processing in important commercial applications. We focus on the important problem of improving the latency of image recognition. CNNs' large data at each layer's input, filters, and output poses a memory bandwidth problem. While previous work captures only some of the enormous data reuse, full reuse implies that the ini… ▽ More Convolutional neural networks (CNNs) are emerging as powerful tools for image processing in important commercial applications. We focus on the important problem of improving the latency of image recognition. CNNs' large data at each layer's input, filters, and output poses a memory bandwidth problem. While previous work captures only some of the enormous data reuse, full reuse implies that the initial input image and filters are read once from off chip and the final output is written once off chip without spilling the intermediate layers' data to off-chip. We propose Occam to capture full reuse via four contributions. (1) We identify the necessary condition for full reuse. (2) We identify the dependence closure as the sufficient condition to capture full reuse using the least on-chip memory. (3) Because the dependence closure is often too large to fit in on-chip memory, we propose a dynamic programming algorithm that optimally partitions a given CNN to guarantee the least off-chip traffic at the partition boundaries for a given on-chip capacity. Occam's partitions reside on different chips forming a pipeline so that a partition's filters and dependence closure remain on-chip as different images pass through (i.e., each partition incurs off-chip traffic only for its inputs and outputs). (4) because the optimal partitions may result in an unbalanced pipeline, we propose staggered asynchronous pipelines (STAP) which replicates the bottleneck stages to improve throughput by staggering the mini-batches across the replicas. Importantly, STAP achieves balanced pipelines without changing Occam's optimal partitioning. Our simulations show that Occam cuts off-chip transfers by 21x and achieves 2.06x and 1.36x better performance, and 33\% and 24\% better energy than the base case and Layer Fusion, respectively. On an FPGA implementation, Occam performs 5.1x better than the base case. △ Less

Submitted 26 June, 2021; originally announced June 2021.

arXiv:2104.08734 [pdf, other]

Barrier-Free Large-Scale Sparse Tensor Accelerator (BARISTA) For Convolutional Neural Networks

Authors: Ashish Gondimalla, Sree Charan Gundabolu, T. N. Vijaykumar, Mithuna Thottethodi

Abstract: Convolutional neural networks (CNNs) are emerging as powerful tools for visual recognition. Recent architecture proposals for sparse CNNs exploit zeros in the feature maps and filters for performance and energy without losing accuracy. Sparse architectures that exploit two-sided sparsity in both feature maps and filters have been studied only at small scales (e.g., 1K multiply-accumulate(MAC) unit… ▽ More Convolutional neural networks (CNNs) are emerging as powerful tools for visual recognition. Recent architecture proposals for sparse CNNs exploit zeros in the feature maps and filters for performance and energy without losing accuracy. Sparse architectures that exploit two-sided sparsity in both feature maps and filters have been studied only at small scales (e.g., 1K multiply-accumulate(MAC) units). However, to realize their advantages in full, the sparse architectures have to be scaled up to levels of the dense architectures (e.g., 32K MACs in the TPU). Such scaling is challenging since achieving reuse through broadcasts incurs implicit barrier cost raises the inter-related issues of load imbalance, buffering, and on-chip bandwidth demand. SparTen, a previous scheme, addresses one aspect of load balancing but not other aspects, nor the other issues of buffering and bandwidth. To that end, we propose the barrier-free large-scale sparse tensor accelerator (BARISTA). BARISTA (1) is the first architecture for scaling up sparse CNN accelerators; (2) reduces on-chip bandwidth demand by telesco** request-combining the input map requests and snarfing the filter requests; (3) reduces buffering via basic buffer sharing and avoids the ensuing barriers between consecutive input maps by coloring the output buffers; (4) load balances intra-filter work via dynamic round-robin work assignment; and (5) employs hierarchical buffering which achieves high cache bandwidth via a few, wide, shared buffers and low buffering via narrower, private buffers at the compute. Our simulations show that, on average, barista performs 5.4x, 2.2x, 1.7x, 2.5x better than a dense, a one-sided, a naively-scaled two-sided, and an iso-area two-sided architecture, respectively. Using 45-nm technology, ASIC synthesis of our RTL design for four clusters of 8K MACs at 1 GHz clock speed, reports 213 mm$^2$ area and 170 W power. △ Less

Submitted 8 May, 2021; v1 submitted 18 April, 2021; originally announced April 2021.

arXiv:2011.02022 [pdf, other]

Booster: An Accelerator for Gradient Boosting Decision Trees

Authors: Mingxuan He, T. N. Vijaykumar, Mithuna Thottethodi

Abstract: We propose Booster, a novel accelerator for gradient boosting trees based on the unique characteristics of gradient boosting models. We observe that the dominant steps of gradient boosting training (accounting for 90-98% of training time) involve simple, fine-grained, independent operations on small-footprint data structures (e.g., accumulate and compare values in the structures). Unfortunately, e… ▽ More We propose Booster, a novel accelerator for gradient boosting trees based on the unique characteristics of gradient boosting models. We observe that the dominant steps of gradient boosting training (accounting for 90-98% of training time) involve simple, fine-grained, independent operations on small-footprint data structures (e.g., accumulate and compare values in the structures). Unfortunately, existing multicores and GPUs are unable to harness this parallelism because they do not support massively-parallel data structure accesses that are irregular and data-dependent. By employing a scalable sea-of-small-SRAMs approach and an SRAM bandwidth-preserving map** of data record fields to the SRAMs, Booster achieves significantly more parallelism (e.g., 3200-way parallelism) than multicores and GPU. In addition, Booster employs a redundant data representation that significantly lowers the memory bandwidth demand. Our simulations reveal that Booster achieves 11.4x speedup and 6.4x speedup over an ideal 32-core multicore and an ideal GPU, respectively. Based on ASIC synthesis of FPGA-validated RTL using 45 nm technology, we estimate a Booster chip to occupy 60 mm^2 of area and dissipate 23 W when operating at 1-GHz clock speed. △ Less

Submitted 5 November, 2020; v1 submitted 3 November, 2020; originally announced November 2020.

arXiv:1805.11158 [pdf, other]

Dart: Divide and Specialize for Fast Response to Congestion in RDMA-based Datacenter Networks

Authors: Jiachen Xue, Muhammad Usama Chaudhry, Balajee Vamanan, T. N. Vijaykumar, Mithuna Thottethodi

Abstract: Though Remote Direct Memory Access (RDMA) promises to reduce datacenter network latencies significantly compared to TCP (e.g., 10x), end-to-end congestion control in the presence of incasts is a challenge. Targeting the full generality of the congestion problem, previous schemes rely on slow, iterative convergence to the appropriate sending rates (e.g., TIMELY takes 50 RTTs). Several papers have s… ▽ More Though Remote Direct Memory Access (RDMA) promises to reduce datacenter network latencies significantly compared to TCP (e.g., 10x), end-to-end congestion control in the presence of incasts is a challenge. Targeting the full generality of the congestion problem, previous schemes rely on slow, iterative convergence to the appropriate sending rates (e.g., TIMELY takes 50 RTTs). Several papers have shown that even in oversubscribed datacenter networks most congestion occurs at the receiver. Accordingly, we propose a divide-and-specialize approach, called Dart, which isolates the common case of receiver congestion and further subdivides the remaining in-network congestion into the simpler spatially-localized and the harder spatially-dispersed cases. For receiver congestion, we propose direct apportioning of sending rates (DASR) in which a receiver for n senders directs each sender to cut its rate by a factor of n, converging in only one RTT. For the spatially-localized case, Dart provides fast (under one RTT) response by adding novel switch hardware for in-order flow deflection (IOFD) because RDMA disallows packet reordering on which previous load balancing schemes rely. For the uncommon spatially-dispersed case, Dart falls back to DCQCN. Small-scale testbed measurements and at-scale simulations, respectively, show that Dart achieves 60% (2.5x) and 79% (4.8x) lower 99th-percentile latency, and similar and 58% higher throughput than InfiniBand, and TIMELY and DCQCN. △ Less

Submitted 30 December, 2019; v1 submitted 28 May, 2018; originally announced May 2018.

Comments: 15 pages, 14 figures

MSC Class: C.2.2 ACM Class: C.2.2

arXiv:1609.07192 [pdf, other]

Hydra: Leveraging Functional Slicing for Efficient Distributed SDN Controllers

Authors: Yiyang Chang, Ashkan Rezaei, Balajee Vamanan, Jahangir Hasan, Sanjay Rao, T. N. Vijaykumar

Abstract: The conventional approach to scaling Software Defined Networking (SDN) controllers today is to partition switches based on network topology, with each partition being controlled by a single physical controller, running all SDN applications. However, topological partitioning is limited by the fact that (i) performance of latency-sensitive (e.g., monitoring) SDN applications associated with a given… ▽ More The conventional approach to scaling Software Defined Networking (SDN) controllers today is to partition switches based on network topology, with each partition being controlled by a single physical controller, running all SDN applications. However, topological partitioning is limited by the fact that (i) performance of latency-sensitive (e.g., monitoring) SDN applications associated with a given partition may be impacted by co-located compute-intensive (e.g., route computation) applications; (ii) simultaneously achieving low convergence time and response times might be challenging; and (iii) communication between instances of an application across partitions may increase latencies. To tackle these issues, in this paper, we explore functional slicing, a complementary approach to scaling, where multiple SDN applications belonging to the same topological partition may be placed in physically distinct servers. We present Hydra, a framework for distributed SDN controllers based on functional slicing. Hydra chooses partitions based on convergence time as the primary metric, but places application instances across partitions in a manner that keeps response times low while considering communication between applications of a partition, and instances of an application across partitions. Evaluations using the Floodlight controller show the importance and effectiveness of Hydra in simultaneously kee** convergence times on failures small, while sustaining higher throughput per partition and ensuring responsiveness to latency-sensitive applications. △ Less

Submitted 22 September, 2016; originally announced September 2016.

Comments: 8 pages

arXiv:1504.04297 [pdf]

MigrantStore: Leveraging Virtual Memory in DRAM-PCM Memory Architecture

Authors: Hamza Bin Sohail, Balajee Vamanan, T. N. Vijaykumar

Abstract: With the imminent slowing down of DRAM scaling, Phase Change Memory (PCM) is emerging as a lead alternative for main memory technology. While PCM achieves low energy due to various technology-specific advantages, PCM is significantly slower than DRAM (especially for writes) and can endure far fewer writes before wearing out. Previous work has proposed to use a large, DRAM-based hardware cache to a… ▽ More With the imminent slowing down of DRAM scaling, Phase Change Memory (PCM) is emerging as a lead alternative for main memory technology. While PCM achieves low energy due to various technology-specific advantages, PCM is significantly slower than DRAM (especially for writes) and can endure far fewer writes before wearing out. Previous work has proposed to use a large, DRAM-based hardware cache to absorb writes and provide faster access. However, due to ineffectual caching where blocks are evicted before sufficient number of accesses, hardware caches incur significant overheads in energy and bandwidth, two key but scarce resources in modern multicores. Because using hardware for detecting and removing such ineffectual caching would incur additional hardware cost and complexity, we leverage the OS virtual memory support for this purpose. We propose a DRAM-PCM hybrid memory architecture where the OS migrates pages on demand from the PCM to DRAM. We call the DRAM part of our memory as MigrantStore which includes two ideas. First, to reduce the energy, bandwidth, and wear overhead of ineffectual migrations, we propose migration hysteresis. Second, to reduce the software overhead of good replacement policies, we propose recently- accessed-page-id (RAPid) buffer, a hardware buffer to track the addresses of recently-accessed MigrantStore pages. △ Less

Submitted 16 April, 2015; originally announced April 2015.

arXiv:1503.05338 [pdf]

TimeTrader: Exploiting Latency Tail to Save Datacenter Energy for On-line Data-Intensive Applications

Authors: Balajee Vamanan, Hamza Bin Sohail, Jahangir Hasan, T. N. Vijaykumar

Abstract: Datacenters running on-line, data-intensive applications (OLDIs) consume significant amounts of energy. However, reducing their energy is challenging due to their tight response time requirements. A key aspect of OLDIs is that each user query goes to all or many of the nodes in the cluster, so that the overall time budget is dictated by the tail of the replies' latency distribution; replies see la… ▽ More Datacenters running on-line, data-intensive applications (OLDIs) consume significant amounts of energy. However, reducing their energy is challenging due to their tight response time requirements. A key aspect of OLDIs is that each user query goes to all or many of the nodes in the cluster, so that the overall time budget is dictated by the tail of the replies' latency distribution; replies see latency variations both in the network and compute. Previous work proposes to achieve load-proportional energy by slowing down the computation at lower datacenter loads based directly on response times (i.e., at lower loads, the proposal exploits the average slack in the time budget provisioned for the peak load). In contrast, we propose TimeTrader to reduce energy by exploiting the latency slack in the sub- critical replies which arrive before the deadline (e.g., 80% of replies are 3-4x faster than the tail). This slack is present at all loads and subsumes the previous work's load-related slack. While the previous work shifts the leaves' response time distribution to consume the slack at lower loads, TimeTrader reshapes the distribution at all loads by slowing down individual sub-critical nodes without increasing missed deadlines. TimeTrader exploits slack in both the network and compute budgets. Further, TimeTrader leverages Earliest Deadline First scheduling to largely decouple critical requests from the queuing delays of sub- critical requests which can then be slowed down without hurting critical requests. A combination of real-system measurements and at-scale simulations shows that without adding to missed deadlines, TimeTrader saves 15-19% and 41-49% energy at 90% and 30% loading, respectively, in a datacenter with 512 nodes, whereas previous work saves 0% and 31-37%. △ Less

Submitted 18 March, 2015; originally announced March 2015.

Comments: 13 pages

Showing 1–10 of 10 results for author: Vijaykumar, T N