Search | arXiv e-print repository

TCAM-SSD: A Framework for Search-Based Computing in Solid-State Drives

Authors: Ryan Wong, Nikita Kim, Kevin Higgs, Sapan Agarwal, Engin Ipek, Saugata Ghose, Ben Feinberg

Abstract: As the amount of data produced in society continues to grow at an exponential rate, modern applications are incurring significant performance and energy penalties due to high data movement between the CPU and memory/storage. While processing in main memory can alleviate these penalties, it is becoming increasingly difficult to keep large datasets entirely in main memory. This has led to a recent p… ▽ More As the amount of data produced in society continues to grow at an exponential rate, modern applications are incurring significant performance and energy penalties due to high data movement between the CPU and memory/storage. While processing in main memory can alleviate these penalties, it is becoming increasingly difficult to keep large datasets entirely in main memory. This has led to a recent push for in-storage computation, where processing is performed inside the storage device. We propose TCAM-SSD, a new framework for search-based computation inside the NAND flash memory arrays of a conventional solid-state drive (SSD), which requires lightweight modifications to only the array periphery and firmware. TCAM-SSD introduces a search manager and link table, which can logically partition the NAND flash memory's contents into search-enabled regions and standard storage regions. Together, these light firmware changes enable TCAM-SSD to seamlessly handle block I/O operations, in addition to new search operations, thereby reducing end-to-end execution time and total data movement. We provide an NVMe-compatible interface that provides programmers with the ability to dynamically allocate data on and make use of TCAM-SSD, allowing the system to be leveraged by a wide variety of applications. We evaluate three example use cases of TCAM-SSD to demonstrate its benefits. For transactional databases, TCAM-SSD can mitigate the performance penalties for applications with large datasets, achieving a 60.9% speedup over a conventional system that retrieves data from the SSD and computes using the CPU. For database analytics, TCAM-SSD provides an average speedup of 17.7x over a conventional system for a collection of analytical queries. For graph analytics, we combine TCAM-SSD's associative search with a sparse data structure, speeding up graph computing for larger-than-memory datasets by 14.5%. △ Less

Submitted 11 March, 2024; originally announced March 2024.

arXiv:2305.13130 [pdf, other]

Scaling Serverless Functions in Edge Networks: A Reinforcement Learning Approach

Authors: Mounir Bensalem, Erkan Ipek, Admela Jukan

Abstract: With rapid advances in containerization techniques, the serverless computing model is becoming a valid candidate execution model in edge networking, similar to the widely used cloud model for applications that are stateless, single purpose and event-driven, and in particular for delay-sensitive applications. One of the cloud serverless processes, i.e., the auto-scaling mechanism, cannot be however… ▽ More With rapid advances in containerization techniques, the serverless computing model is becoming a valid candidate execution model in edge networking, similar to the widely used cloud model for applications that are stateless, single purpose and event-driven, and in particular for delay-sensitive applications. One of the cloud serverless processes, i.e., the auto-scaling mechanism, cannot be however directly applied at the edge, due to the distributed nature of edge nodes, the difficulty of optimal resource allocation, and the delay sensitivity of workloads. We propose a solution to the auto-scaling problem by applying reinforcement learning (RL) approach to solving problem of efficient scaling and resource allocation of serverless functions in edge networks. We compare RL and Deep RL algorithms with empirical, monitoring-based heuristics, considering delay-sensitive applications. The simulation results shows that RL algorithm outperforms the standard, monitoring-based algorithms in terms of total delay of function requests, while achieving an improvement in delay performance by up to 50%. △ Less

Submitted 22 May, 2023; originally announced May 2023.

Comments: This paper is uploaded here for research community, thus it is for non-commercial purposes

arXiv:1707.09952 [pdf]

doi 10.1109/JETCAS.2018.2796379

Multiscale Co-Design Analysis of Energy, Latency, Area, and Accuracy of a ReRAM Analog Neural Training Accelerator

Authors: Matthew J. Marinella, Sapan Agarwal, Alexander Hsia, Isaac Richter, Robin Jacobs-Gedrim, John Niroula, Steven J. Plimpton, Engin Ipek, Conrad D. James

Abstract: Neural networks are an increasingly attractive algorithm for natural language processing and pattern recognition. Deep networks with >50M parameters are made possible by modern GPU clusters operating at <50 pJ per op and more recently, production accelerators capable of <5pJ per operation at the board level. However, with the slowing of CMOS scaling, new paradigms will be required to achieve the n… ▽ More Neural networks are an increasingly attractive algorithm for natural language processing and pattern recognition. Deep networks with >50M parameters are made possible by modern GPU clusters operating at <50 pJ per op and more recently, production accelerators capable of <5pJ per operation at the board level. However, with the slowing of CMOS scaling, new paradigms will be required to achieve the next several orders of magnitude in performance per watt gains. Using an analog resistive memory (ReRAM) crossbar to perform key matrix operations in an accelerator is an attractive option. This work presents a detailed design using a state of the art 14/16 nm PDK for of an analog crossbar circuit block designed to process three key kernels required in training and inference of neural networks. A detailed circuit and device-level analysis of energy, latency, area, and accuracy are given and compared to relevant designs using standard digital ReRAM and SRAM operations. It is shown that the analog accelerator has a 270x energy and 540x latency advantage over a similar block utilizing only digital ReRAM and takes only 11 fJ per multiply and accumulate (MAC). Compared to an SRAM based accelerator, the energy is 430X better and latency is 34X better. Although training accuracy is degraded in the analog accelerator, several options to improve this are presented. The possible gains over a similar digital-only version of this accelerator block suggest that continued optimization of analog resistive memories is valuable. This detailed circuit and device analysis of a training accelerator may serve as a foundation for further architecture-level studies. △ Less

Submitted 16 February, 2018; v1 submitted 31 July, 2017; originally announced July 2017.

Showing 1–3 of 3 results for author: Ipek, E