Skip to main content

Showing 1–19 of 19 results for author: Subramoney, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.18786  [pdf, other

    cs.AR

    Constable: Improving Performance and Power Efficiency by Safely Eliminating Load Instruction Execution

    Authors: Rahul Bera, Adithya Ranganathan, Joydeep Rakshit, Sujit Mahto, Anant V. Nori, Jayesh Gaur, Ataberk Olgun, Konstantinos Kanellopoulos, Mohammad Sadrosadati, Sreenivas Subramoney, Onur Mutlu

    Abstract: Load instructions often limit instruction-level parallelism (ILP) in modern processors due to data and resource dependences they cause. Prior techniques like Load Value Prediction (LVP) and Memory Renaming (MRN) mitigate load data dependence by predicting the data value of a load instruction. However, they fail to mitigate load resource dependence as the predicted load instruction gets executed no… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

    Comments: To appear in the proceedings of 51st International Symposium on Computer Architecture (ISCA)

  2. arXiv:2406.10247  [pdf, other

    cs.CL cs.AI

    QCQA: Quality and Capacity-aware grouped Query Attention

    Authors: Vinay Joshi, Prashant Laddha, Shambhavi Sinha, Om Ji Omer, Sreenivas Subramoney

    Abstract: Excessive memory requirements of key and value features (KV-cache) present significant challenges in the autoregressive inference of large language models (LLMs), restricting both the speed and length of text generation. Approaches such as Multi-Query Attention (MQA) and Grouped Query Attention (GQA) mitigate these challenges by grou** query heads and consequently reducing the number of correspo… ▽ More

    Submitted 8 June, 2024; originally announced June 2024.

  3. arXiv:2404.13886  [pdf, other

    cs.OS cs.ET

    Taming Server Memory TCO with Multiple Software-Defined Compressed Tiers

    Authors: Sandeep Kumar, Aravinda Prasad, Sreenivas Subramoney

    Abstract: Memory accounts for 33 - 50% of the total cost of ownership (TCO) in modern data centers. We propose a novel solution to tame memory TCO through the novel creation and judicious management of multiple software-defined compressed memory tiers. As opposed to the state-of-the-art solutions that employ a 2-Tier solution, a single compressed tier along with DRAM, we define multiple compressed tiers i… ▽ More

    Submitted 22 April, 2024; originally announced April 2024.

  4. arXiv:2402.11780  [pdf, other

    cs.AR cs.AI

    CiMNet: Towards Joint Optimization for DNN Architecture and Configuration for Compute-In-Memory Hardware

    Authors: Souvik Kundu, Anthony Sarah, Vinay Joshi, Om J Omer, Sreenivas Subramoney

    Abstract: With the recent growth in demand for large-scale deep neural networks, compute in-memory (CiM) has come up as a prominent solution to alleviate bandwidth and on-chip interconnect bottlenecks that constrain Von-Neuman architectures. However, the construction of CiM hardware poses a challenge as any specific memory hierarchy in terms of cache sizes and memory bandwidth at different interfaces may no… ▽ More

    Submitted 18 March, 2024; v1 submitted 18 February, 2024; originally announced February 2024.

    Comments: 6 pages, 4 figures, 5 tables; Accepted as a full paper by the tinyML Research Symposium 2024

  5. arXiv:2311.10275  [pdf, other

    cs.OS cs.AR cs.DB cs.DC

    Telescope: Telemetry at Terabyte Scale

    Authors: Alan Nair, Sandeep Kumar, Aravinda Prasad, Andy Rudoff, Sreenivas Subramoney

    Abstract: Data-hungry applications that require terabytes of memory have become widespread in recent years. To meet the memory needs of these applications, data centers are embracing tiered memory architectures with near and far memory tiers. Precise, efficient, and timely identification of hot and cold data and their placement in appropriate tiers is critical for performance in such systems. Unfortunately,… ▽ More

    Submitted 29 November, 2023; v1 submitted 16 November, 2023; originally announced November 2023.

  6. arXiv:2310.03370  [pdf, other

    cs.OS

    Motivating Next-Generation OS Physical Memory Management for Terabyte-Scale NVMMs

    Authors: Shivank Garg, Aravinda Prasad, Debadatta Mishra, Sreenivas Subramoney

    Abstract: Software managed byte-addressable hybrid memory systems consisting of DRAMs and NVMMs offer a lot of flexibility to design efficient large scale data processing applications. Operating systems (OS) play an important role in enabling the applications to realize the integrated benefits of DRAMs' low access latency and NVMMs' large capacity along with its persistent characteristics. In this paper, we… ▽ More

    Submitted 5 October, 2023; originally announced October 2023.

    Comments: 14 pages, 24 figures, 2 tables

    ACM Class: D.4.8

  7. arXiv:2304.07941  [pdf, other

    cs.DC cs.LG

    Reclaimer: A Reinforcement Learning Approach to Dynamic Resource Allocation for Cloud Microservices

    Authors: Quintin Fettes, Avinash Karanth, Razvan Bunescu, Brandon Beckwith, Sreenivas Subramoney

    Abstract: Many cloud applications are migrated from the monolithic model to a microservices framework in which hundreds of loosely-coupled microservices run concurrently, with significant benefits in terms of scalability, rapid development, modularity, and isolation. However, dependencies among microservices with uneven execution time may result in longer queues, idle resources, or Quality-of-Service (QoS)… ▽ More

    Submitted 16 April, 2023; originally announced April 2023.

  8. arXiv:2302.08687  [pdf, other

    cs.AR cs.AI cs.LG

    VEGETA: Vertically-Integrated Extensions for Sparse/Dense GEMM Tile Acceleration on CPUs

    Authors: Geonhwa Jeong, Sana Damani, Abhimanyu Rajeshkumar Bambhaniya, Eric Qin, Christopher J. Hughes, Sreenivas Subramoney, Hyesoon Kim, Tushar Krishna

    Abstract: Deep Learning (DL) acceleration support in CPUs has recently gained a lot of traction, with several companies (Arm, Intel, IBM) announcing products with specialized matrix engines accessible via GEMM instructions. CPUs are pervasive and need to handle diverse requirements across DL workloads running in edge/HPC/cloud platforms. Therefore, as DL workloads embrace sparsity to reduce the computations… ▽ More

    Submitted 23 February, 2023; v1 submitted 16 February, 2023; originally announced February 2023.

    Comments: This paper is accepted to HPCA 2023

  9. arXiv:2207.09765  [pdf, other

    cs.AR cs.AI cs.LG q-bio.GN q-bio.QM

    ApHMM: Accelerating Profile Hidden Markov Models for Fast and Energy-Efficient Genome Analysis

    Authors: Can Firtina, Kamlesh Pillai, Gurpreet S. Kalsi, Bharathwaj Suresh, Damla Senol Cali, Jeremie Kim, Taha Shahroodi, Meryem Banu Cavlak, Joel Lindegger, Mohammed Alser, Juan Gómez Luna, Sreenivas Subramoney, Onur Mutlu

    Abstract: Profile hidden Markov models (pHMMs) are widely employed in various bioinformatics applications to identify similarities between biological sequences, such as DNA or protein sequences. In pHMMs, sequences are represented as graph structures. These probabilities are subsequently used to compute the similarity score between a sequence and a pHMM graph. The Baum-Welch algorithm, a prevalent and highl… ▽ More

    Submitted 21 October, 2023; v1 submitted 20 July, 2022; originally announced July 2022.

    Comments: Accepted to ACM TACO

  10. SeGraM: A Universal Hardware Accelerator for Genomic Sequence-to-Graph and Sequence-to-Sequence Map**

    Authors: Damla Senol Cali, Konstantinos Kanellopoulos, Joel Lindegger, Zülal Bingöl, Gurpreet S. Kalsi, Ziyi Zuo, Can Firtina, Meryem Banu Cavlak, Jeremie Kim, Nika Mansouri Ghiasi, Gagandeep Singh, Juan Gómez-Luna, Nour Almadhoun Alserr, Mohammed Alser, Sreenivas Subramoney, Can Alkan, Saugata Ghose, Onur Mutlu

    Abstract: A critical step of genome sequence analysis is the map** of sequenced DNA fragments (i.e., reads) collected from an individual to a known linear reference genome sequence (i.e., sequence-to-sequence map**). Recent works replace the linear reference sequence with a graph-based representation of the reference genome, which captures the genetic variations and diversity across many individuals in… ▽ More

    Submitted 31 May, 2022; v1 submitted 12 May, 2022; originally announced May 2022.

    Comments: To appear in ISCA'22

  11. arXiv:2111.08434  [pdf, other

    cs.CV

    Robust 3D Scene Segmentation through Hierarchical and Learnable Part-Fusion

    Authors: Anirud Thyagharajan, Benjamin Ummenhofer, Prashant Laddha, Om J Omer, Sreenivas Subramoney

    Abstract: 3D semantic segmentation is a fundamental building block for several scene understanding applications such as autonomous driving, robotics and AR/VR. Several state-of-the-art semantic segmentation models suffer from the part misclassification problem, wherein parts of the same object are labelled incorrectly. Previous methods have utilized hierarchical, iterative methods to fuse semantic and insta… ▽ More

    Submitted 16 November, 2021; originally announced November 2021.

  12. arXiv:2110.01752  [pdf, other

    cs.AR cs.AI cs.LG

    RASA: Efficient Register-Aware Systolic Array Matrix Engine for CPU

    Authors: Geonhwa Jeong, Eric Qin, Ananda Samajdar, Christopher J. Hughes, Sreenivas Subramoney, Hyesoon Kim, Tushar Krishna

    Abstract: As AI-based applications become pervasive, CPU vendors are starting to incorporate matrix engines within the datapath to boost efficiency. Systolic arrays have been the premier architectural choice as matrix engines in offload accelerators. However, we demonstrate that incorporating them inside CPUs can introduce under-utilization and stalls due to limited register storage to amortize the fill and… ▽ More

    Submitted 4 October, 2021; originally announced October 2021.

    Comments: This paper is accepted to DAC 2021

  13. Pythia: A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning

    Authors: Rahul Bera, Konstantinos Kanellopoulos, Anant V. Nori, Taha Shahroodi, Sreenivas Subramoney, Onur Mutlu

    Abstract: Past research has proposed numerous hardware prefetching techniques, most of which rely on exploiting one specific type of program context information (e.g., program counter, cacheline address) to predict future memory accesses. These techniques either completely neglect a prefetcher's undesirable effects (e.g., memory bandwidth usage) on the overall system, or incorporate system-level feedback as… ▽ More

    Submitted 6 April, 2023; v1 submitted 24 September, 2021; originally announced September 2021.

    ACM Class: C.1.2

  14. arXiv:2103.10779  [pdf, other

    cs.DC cs.OS cs.PF

    Page Table Management for Heterogeneous Memory Systems

    Authors: Sandeep Kumar, Aravinda Prasad, Smruti R. Sarangi, Sreenivas Subramoney

    Abstract: Modern enterprise servers are increasingly embracing tiered memory systems with a combination of low latency DRAMs and large capacity but high latency non-volatile main memories (NVMMs) such as Intel's Optane DC PMM. Prior works have focused on efficient placement and migration of data on a tiered memory system, but have not studied the optimal placement of page tables. Explicit and efficient pl… ▽ More

    Submitted 16 March, 2021; originally announced March 2021.

  15. arXiv:2011.12669  [pdf, other

    cs.AR

    AccSS3D: Accelerator for Spatially Sparse 3D DNNs

    Authors: Om Ji Omer, Prashant Laddha, Gurpreet S Kalsi, Anirud Thyagharajan, Kamlesh R Pillai, Abhimanyu Kulkarni, Anbang Yao, Yurong Chen, Sreenivas Subramoney

    Abstract: Semantic understanding and completion of real world scenes is a foundational primitive of 3D Visual perception widely used in high-level applications such as robotics, medical imaging, autonomous driving and navigation. Due to the curse of dimensionality, compute and memory requirements for 3D scene understanding grow in cubic complexity with voxel resolution, posing a huge impediment to realizing… ▽ More

    Submitted 25 November, 2020; originally announced November 2020.

  16. arXiv:2011.11695  [pdf, other

    cs.AR

    Proximu$: Efficiently Scaling DNN Inference in Multi-core CPUs through Near-Cache Compute

    Authors: Anant V. Nori, Rahul Bera, Shankar Balachandran, Joydeep Rakshit, Om J. Omer, Avishaii Abuhatzera, Belliappa Kuttanna, Sreenivas Subramoney

    Abstract: Deep Neural Network (DNN) inference is emerging as the fundamental bedrock for a multitude of utilities and services. CPUs continue to scale up their raw compute capabilities for DNN inference along with mature high performance libraries to extract optimal performance. While general purpose CPUs offer unique attractive advantages for DNN inference at both datacenter and edge, they have primarily e… ▽ More

    Submitted 2 December, 2020; v1 submitted 23 November, 2020; originally announced November 2020.

    Comments: 18 pages, 21 figures

  17. arXiv:2009.07692  [pdf, other

    cs.AR q-bio.GN

    GenASM: A High-Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence Analysis

    Authors: Damla Senol Cali, Gurpreet S. Kalsi, Zülal Bingöl, Can Firtina, Lavanya Subramanian, Jeremie S. Kim, Rachata Ausavarungnirun, Mohammed Alser, Juan Gomez-Luna, Amirali Boroumand, Anant Nori, Allison Scibisz, Sreenivas Subramoney, Can Alkan, Saugata Ghose, Onur Mutlu

    Abstract: Genome sequence analysis has enabled significant advancements in medical and scientific areas such as personalized medicine, outbreak tracing, and the understanding of evolution. Unfortunately, it is currently bottlenecked by the computational power and memory bandwidth limitations of existing systems, as many of the steps in genome sequence analysis must process a large amount of data. A major co… ▽ More

    Submitted 16 September, 2020; originally announced September 2020.

    Comments: To appear in MICRO 2020

  18. DSPatch: Dual Spatial Pattern Prefetcher

    Authors: Rahul Bera, Anant V. Nori, Onur Mutlu, Sreenivas Subramoney

    Abstract: High main memory latency continues to limit performance of modern high-performance out-of-order cores. While DRAM latency has remained nearly the same over many generations, DRAM bandwidth has grown significantly due to higher frequencies, newer architectures (DDR4, LPDDR4, GDDR5) and 3D-stacked memory packaging (HBM). Current state-of-the-art prefetchers do not do well in extracting higher perfor… ▽ More

    Submitted 7 October, 2019; originally announced October 2019.

    Comments: This work is to appear in MICRO 2019

  19. arXiv:1808.03518  [pdf, other

    cs.PF

    MARS: Memory Aware Reordered Source

    Authors: Ishwar Bhati, Udit Dhawan, Jayesh Gaur, Sreenivas Subramoney, Hong Wang

    Abstract: Memory bandwidth is critical in today's high performance computing systems. The bandwidth is particularly paramount for GPU workloads such as 3D Gaming, Imaging and Perceptual Computing, GPGPU due to their data-intensive nature. As the number of threads and data streams in the GPUs increases with each generation, along with a high available memory bandwidth, memory efficiency is also crucial in or… ▽ More

    Submitted 1 August, 2018; originally announced August 2018.