Skip to main content

Showing 1–50 of 54 results for author: Gómez-Luna, J

.
  1. arXiv:2405.06081  [pdf, other

    cs.AR cs.DC

    Simultaneous Many-Row Activation in Off-the-Shelf DRAM Chips: Experimental Characterization and Analysis

    Authors: Ismail Emir Yuksel, Yahya Can Tugrul, F. Nisa Bostanci, Geraldo F. Oliveira, A. Giray Yaglikci, Ataberk Olgun, Melina Soysal, Haocong Luo, Juan Gómez-Luna, Mohammad Sadrosadati, Onur Mutlu

    Abstract: We experimentally analyze the computational capability of commercial off-the-shelf (COTS) DRAM chips and the robustness of these capabilities under various timing delays between DRAM commands, data patterns, temperature, and voltage levels. We extensively characterize 120 COTS DDR4 chips from two major manufacturers. We highlight four key results of our study. First, COTS DRAM chips are capable of… ▽ More

    Submitted 9 May, 2024; originally announced May 2024.

    Comments: To appear in DSN 2024

  2. arXiv:2405.03967  [pdf, other

    cs.LG cs.AI cs.AR

    SwiftRL: Towards Efficient Reinforcement Learning on Real Processing-In-Memory Systems

    Authors: Kailash Gogineni, Sai Santosh Dayapule, Juan Gómez-Luna, Karthikeya Gogineni, Peng Wei, Tian Lan, Mohammad Sadrosadati, Onur Mutlu, Guru Venkataramani

    Abstract: Reinforcement Learning (RL) trains agents to learn optimal behavior by maximizing reward signals from experience datasets. However, RL training often faces memory limitations, leading to execution latencies and prolonged training times. To overcome this, SwiftRL explores Processing-In-Memory (PIM) architectures to accelerate RL workloads. We achieve near-linear performance scaling by implementing… ▽ More

    Submitted 6 May, 2024; originally announced May 2024.

  3. arXiv:2404.07164  [pdf, other

    cs.AR cs.AI cs.DC cs.LG

    Analysis of Distributed Optimization Algorithms on a Real Processing-In-Memory System

    Authors: Steve Rhyner, Haocong Luo, Juan Gómez-Luna, Mohammad Sadrosadati, Jiawei Jiang, Ataberk Olgun, Harshita Gupta, Ce Zhang, Onur Mutlu

    Abstract: Machine Learning (ML) training on large-scale datasets is a very expensive and time-consuming workload. Processor-centric architectures (e.g., CPU, GPU) commonly used for modern ML training workloads are limited by the data movement bottleneck, i.e., due to repeatedly accessing the training dataset. As a result, processor-centric systems suffer from performance degradation and high energy consumpt… ▽ More

    Submitted 10 April, 2024; originally announced April 2024.

  4. arXiv:2403.04539  [pdf, other

    cs.AR

    PUMA: Efficient and Low-Cost Memory Allocation and Alignment Support for Processing-Using-Memory Architectures

    Authors: Geraldo F. Oliveira, Emanuele G. Esposito, Juan Gómez-Luna, Onur Mutlu

    Abstract: Processing-using-DRAM (PUD) architectures impose a restrictive data layout and alignment for their operands, where source and destination operands (i) must reside in the same DRAM subarray (i.e., a group of DRAM rows sharing the same row buffer and row decoder) and (ii) are aligned to the boundaries of a DRAM row. However, standard memory allocation routines (i.e., malloc, posix_memalign, and huge… ▽ More

    Submitted 7 March, 2024; originally announced March 2024.

  5. arXiv:2402.19080  [pdf, other

    cs.AR cs.DC

    MIMDRAM: An End-to-End Processing-Using-DRAM System for High-Throughput, Energy-Efficient and Programmer-Transparent Multiple-Instruction Multiple-Data Processing

    Authors: Geraldo F. Oliveira, Ataberk Olgun, Abdullah Giray Yağlıkçı, F. Nisa Bostancı, Juan Gómez-Luna, Saugata Ghose, Onur Mutlu

    Abstract: Processing-using-DRAM (PUD) is a processing-in-memory (PIM) approach that uses a DRAM array's massive internal parallelism to execute very-wide data-parallel operations, in a single-instruction multiple-data (SIMD) fashion. However, DRAM rows' large and rigid granularity limit the effectiveness and applicability of PUD in three ways. First, since applications have varying degrees of SIMD paralleli… ▽ More

    Submitted 3 March, 2024; v1 submitted 29 February, 2024; originally announced February 2024.

    Comments: Extended version of HPCA 2024 paper. arXiv admin note: text overlap with arXiv:2109.05881 by other authors

  6. arXiv:2402.18736  [pdf, other

    cs.AR cs.DC

    Functionally-Complete Boolean Logic in Real DRAM Chips: Experimental Characterization and Analysis

    Authors: Ismail Emir Yuksel, Yahya Can Tugrul, Ataberk Olgun, F. Nisa Bostanci, A. Giray Yaglikci, Geraldo F. Oliveira, Haocong Luo, Juan Gómez-Luna, Mohammad Sadrosadati, Onur Mutlu

    Abstract: Processing-using-DRAM (PuD) is an emerging paradigm that leverages the analog operational properties of DRAM circuitry to enable massively parallel in-DRAM computation. PuD has the potential to reduce or eliminate costly data movement between processing elements and main memory. Prior works experimentally demonstrate three-input MAJ (MAJ3) and two-input AND and OR operations in commercial off-the-… ▽ More

    Submitted 21 April, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

    Comments: A shorter version of this work is to appear at the 30th IEEE International Symposium on High-Performance Computer Architecture (HPCA-30), 2024

  7. arXiv:2310.10168  [pdf, other

    cs.AR

    DaPPA: A Data-Parallel Framework for Processing-in-Memory Architectures

    Authors: Geraldo F. Oliveira, Alain Kohli, David Novo, Juan Gómez-Luna, Onur Mutlu

    Abstract: To ease the programmability of PIM architectures, we propose DaPPA(data-parallel processing-in-memory architecture), a framework that can, for a given application, automatically distribute input and gather output data, handle memory management, and parallelize work across the DPUs. The key idea behind DaPPA is to remove the responsibility of managing hardware resources from the programmer by provi… ▽ More

    Submitted 16 October, 2023; originally announced October 2023.

  8. arXiv:2310.01893  [pdf, other

    cs.AR cs.DC cs.SE

    SimplePIM: A Software Framework for Productive and Efficient Processing-in-Memory

    Authors: **fan Chen, Juan Gómez-Luna, Izzat El Hajj, Yuxin Guo, Onur Mutlu

    Abstract: Data movement between memory and processors is a major bottleneck in modern computing systems. The processing-in-memory (PIM) paradigm aims to alleviate this bottleneck by performing computation inside memory chips. Real PIM hardware (e.g., the UPMEM system) is now available and has demonstrated potential in many applications. However, programming such real PIM hardware remains a challenge for man… ▽ More

    Submitted 3 October, 2023; originally announced October 2023.

  9. arXiv:2309.06545  [pdf, other

    cs.CR cs.AR

    Evaluating Homomorphic Operations on a Real-World Processing-In-Memory System

    Authors: Harshita Gupta, Mayank Kabra, Juan Gómez-Luna, Konstantinos Kanellopoulos, Onur Mutlu

    Abstract: Computing on encrypted data is a promising approach to reduce data security and privacy risks, with homomorphic encryption serving as a facilitator in achieving this goal. In this work, we accelerate homomorphic operations using the Processing-in- Memory (PIM) paradigm to mitigate the large memory capacity and frequent data movement requirements. Using a real-world PIM system, we accelerate the Br… ▽ More

    Submitted 3 October, 2023; v1 submitted 12 September, 2023; originally announced September 2023.

    Comments: This work will be presented at IISWC 2023

  10. arXiv:2304.01951  [pdf, other

    cs.MS cs.AR cs.DC cs.LG

    TransPimLib: A Library for Efficient Transcendental Functions on Processing-in-Memory Systems

    Authors: Maurus Item, Juan Gómez-Luna, Yuxin Guo, Geraldo F. Oliveira, Mohammad Sadrosadati, Onur Mutlu

    Abstract: Processing-in-memory (PIM) promises to alleviate the data movement bottleneck in modern computing systems. However, current real-world PIM systems have the inherent disadvantage that their hardware is more constrained than in conventional processors (CPU, GPU), due to the difficulty and cost of building processing elements near or inside the memory. As a result, general-purpose PIM architectures s… ▽ More

    Submitted 5 September, 2023; v1 submitted 3 April, 2023; originally announced April 2023.

    Comments: Our open-source software is available at https://github.com/CMU-SAFARI/transpimlib

  11. arXiv:2303.03509  [pdf, other

    cs.AR

    SPARTA: Spatial Acceleration for Efficient and Scalable Horizontal Diffusion Weather Stencil Computation

    Authors: Gagandeep Singh, Alireza Khodamoradi, Kristof Denolf, Jack Lo, Juan Gómez-Luna, Joseph Melber, Andra Bisca, Henk Corporaal, Onur Mutlu

    Abstract: Fast and accurate climate simulations and weather predictions are critical for understanding and preparing for the impact of climate change. Real-world weather and climate modeling consist of complex compound stencil kernels that do not perform well on conventional architectures. Horizontal diffusion is one such important compound stencil found in many climate and weather prediction models. Recent… ▽ More

    Submitted 9 May, 2023; v1 submitted 6 March, 2023; originally announced March 2023.

  12. arXiv:2211.04369  [pdf, other

    cs.AR

    Accelerating Time Series Analysis via Processing using Non-Volatile Memories

    Authors: Ivan Fernandez, Christina Giannoula, Aditya Manglik, Ricardo Quislant, Nika Mansouri Ghiasi, Juan Gómez-Luna, Eladio Gutierrez, Oscar Plata, Onur Mutlu

    Abstract: Time Series Analysis (TSA) is a critical workload to extract valuable information from collections of sequential data, e.g., detecting anomalies in electrocardiograms. Subsequence Dynamic Time War** (sDTW) is the state-of-the-art algorithm for high-accuracy TSA. We find that the performance and energy efficiency of sDTW on conventional CPU and GPU platforms are heavily burdened by the latency an… ▽ More

    Submitted 16 January, 2024; v1 submitted 8 November, 2022; originally announced November 2022.

  13. arXiv:2209.10914  [pdf, other

    cs.AR

    Morpheus: Extending the Last Level Cache Capacity in GPU Systems Using Idle GPU Core Resources

    Authors: Sina Darabi, Mohammad Sadrosadati, Joël Lindegger, Negar Akbarzadeh, Mohammad Hosseini, Jisung Park, Juan Gómez-Luna, Hamid Sarbazi-Azad, Onur Mutlu

    Abstract: Graphics Processing Units (GPUs) are widely-used accelerators for data-parallel applications. In many GPU applications, GPU memory bandwidth bottlenecks performance, causing underutilization of GPU cores. Hence, disabling many cores does not affect the performance of memory-bound workloads. While simply power-gating unused GPU cores would save energy, prior works attempt to better utilize GPU core… ▽ More

    Submitted 6 April, 2023; v1 submitted 22 September, 2022; originally announced September 2022.

    Comments: To appear in 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2022

  14. arXiv:2209.08938  [pdf, other

    cs.AR cs.DC cs.LG

    Accelerating Neural Network Inference with Processing-in-DRAM: From the Edge to the Cloud

    Authors: Geraldo F. Oliveira, Juan Gómez-Luna, Saugata Ghose, Amirali Boroumand, Onur Mutlu

    Abstract: Neural networks (NNs) are growing in importance and complexity. A neural network's performance (and energy efficiency) can be bound either by computation or memory resources. The processing-in-memory (PIM) paradigm, where computation is placed near or within memory arrays, is a viable solution to accelerate memory-bound NNs. However, PIM architectures vary in form, where different PIM approaches l… ▽ More

    Submitted 27 March, 2023; v1 submitted 19 September, 2022; originally announced September 2022.

    Comments: This is an extended and updated version of a paper published in IEEE Micro, pp. 1-14, 29 Aug. 2022. arXiv admin note: text overlap with arXiv:2109.14320

  15. arXiv:2209.05566  [pdf, other

    cs.AR cs.DC

    Flash-Cosmos: In-Flash Bulk Bitwise Operations Using Inherent Computation Capability of NAND Flash Memory

    Authors: Jisung Park, Roknoddin Azizi, Geraldo F. Oliveira, Mohammad Sadrosadati, Rakesh Nadig, David Novo, Juan Gómez-Luna, Myungsuk Kim, Onur Mutlu

    Abstract: Bulk bitwise operations, i.e., bitwise operations on large bit vectors, are prevalent in a wide range of important application domains, including databases, graph processing, genome analysis, cryptography, and hyper-dimensional computing. In conventional systems, the performance and energy efficiency of bulk bitwise operations are bottlenecked by data movement between the compute units and the mem… ▽ More

    Submitted 12 September, 2022; originally announced September 2022.

    Comments: To appear in 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2022

  16. arXiv:2208.10606  [pdf, other

    cs.AR cs.AI cs.LG

    LEAPER: Fast and Accurate FPGA-based System Performance Prediction via Transfer Learning

    Authors: Gagandeep Singh, Dionysios Diamantopoulos, Juan Gómez-Luna, Sander Stuijk, Henk Corporaal, Onur Mutlu

    Abstract: Machine learning has recently gained traction as a way to overcome the slow accelerator generation and implementation process on an FPGA. It can be used to build performance and resource usage models that enable fast early-stage design space exploration. First, training requires large amounts of data (features extracted from design synthesis and implementation tools), which is cost-inefficient bec… ▽ More

    Submitted 2 October, 2022; v1 submitted 22 August, 2022; originally announced August 2022.

  17. Scrooge: A Fast and Memory-Frugal Genomic Sequence Aligner for CPUs, GPUs, and ASICs

    Authors: Joël Lindegger, Damla Senol Cali, Mohammed Alser, Juan Gómez-Luna, Nika Mansouri Ghiasi, Onur Mutlu

    Abstract: Pairwise sequence alignment is a very time-consuming step in common bioinformatics pipelines. Speeding up this step requires heuristics, efficient implementations, and/or hardware acceleration. A promising candidate for all of the above is the recently proposed GenASM algorithm. We identify and address three inefficiencies in the GenASM algorithm: it has a high amount of data movement, a large mem… ▽ More

    Submitted 12 April, 2023; v1 submitted 21 August, 2022; originally announced August 2022.

  18. arXiv:2208.01243  [pdf, other

    cs.AR cs.DC

    A Framework for High-throughput Sequence Alignment using Real Processing-in-Memory Systems

    Authors: Safaa Diab, Amir Nassereldine, Mohammed Alser, Juan Gómez-Luna, Onur Mutlu, Izzat El Hajj

    Abstract: Sequence alignment is a memory bound computation whose performance in modern systems is limited by the memory bandwidth bottleneck. Processing-in-memory architectures alleviate this bottleneck by providing the memory with computing competencies. We propose Alignment-in-Memory (AIM), a framework for high-throughput sequence alignment using processing-in-memory, and evaluate it on UPMEM, the first p… ▽ More

    Submitted 27 March, 2023; v1 submitted 2 August, 2022; originally announced August 2022.

  19. arXiv:2207.07886  [pdf, other

    cs.AR cs.AI cs.DC cs.LG

    An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System

    Authors: Juan Gómez-Luna, Yuxin Guo, Sylvan Brocard, Julien Legriel, Remy Cimadomo, Geraldo F. Oliveira, Gagandeep Singh, Onur Mutlu

    Abstract: Training machine learning (ML) algorithms is a computationally intensive process, which is frequently memory-bound due to repeatedly accessing large training datasets. As a result, processor-centric systems (e.g., CPU, GPU) suffer from costly data movement between memory units and processing units, which consumes large amounts of energy and execution cycles. Memory-centric computing systems, i.e.,… ▽ More

    Submitted 5 September, 2023; v1 submitted 16 July, 2022; originally announced July 2022.

    Comments: Our open-source software is available at https://github.com/CMU-SAFARI/pim-ml

  20. arXiv:2206.06022  [pdf, other

    cs.AR cs.LG

    Machine Learning Training on a Real Processing-in-Memory System

    Authors: Juan Gómez-Luna, Yuxin Guo, Sylvan Brocard, Julien Legriel, Remy Cimadomo, Geraldo F. Oliveira, Gagandeep Singh, Onur Mutlu

    Abstract: Training machine learning algorithms is a computationally intensive process, which is frequently memory-bound due to repeatedly accessing large training datasets. As a result, processor-centric systems (e.g., CPU, GPU) suffer from costly data movement between memory units and processing units, which consumes large amounts of energy and execution cycles. Memory-centric computing systems, i.e., comp… ▽ More

    Submitted 3 August, 2022; v1 submitted 13 June, 2022; originally announced June 2022.

    Comments: This extended abstract appears as an invited paper at the 2022 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)

  21. arXiv:2206.00938  [pdf, other

    cs.AR

    Exploiting Near-Data Processing to Accelerate Time Series Analysis

    Authors: Ivan Fernandez, Ricardo Quislant, Christina Giannoula, Mohammed Alser, Juan Gómez-Luna, Eladio Gutiérrez, Oscar Plata, Onur Mutlu

    Abstract: Time series analysis is a key technique for extracting and predicting events in domains as diverse as epidemiology, genomics, neuroscience, environmental sciences, economics, and more. Matrix profile, the state-of-the-art algorithm to perform time series analysis, computes the most similar subsequence for a given query subsequence within a sliced time series. Matrix profile has low arithmetic inte… ▽ More

    Submitted 2 June, 2022; originally announced June 2022.

    Comments: To appear in ISVLSI 2022 Special Session on Processing in Memory. arXiv admin note: text overlap with arXiv:2010.02079

  22. arXiv:2205.14664  [pdf, other

    cs.AR cs.AI cs.DB cs.DC cs.LG

    Heterogeneous Data-Centric Architectures for Modern Data-Intensive Applications: Case Studies in Machine Learning and Databases

    Authors: Geraldo F. Oliveira, Amirali Boroumand, Saugata Ghose, Juan Gómez-Luna, Onur Mutlu

    Abstract: Today's computing systems require moving data back-and-forth between computing resources (e.g., CPUs, GPUs, accelerators) and off-chip main memory so that computation can take place on the data. Unfortunately, this data movement is a major bottleneck for system performance and energy consumption. One promising execution paradigm that alleviates the data movement bottleneck in modern and emerging a… ▽ More

    Submitted 29 May, 2022; originally announced May 2022.

  23. arXiv:2205.14647  [pdf, other

    cs.AR cs.DC cs.PF

    Methodologies, Workloads, and Tools for Processing-in-Memory: Enabling the Adoption of Data-Centric Architectures

    Authors: Geraldo F. Oliveira, Juan Gómez-Luna, Saugata Ghose, Onur Mutlu

    Abstract: The increasing prevalence and growing size of data in modern applications have led to high costs for computation in traditional processor-centric computing systems. Moving large volumes of data between memory devices (e.g., DRAM) and computing elements (e.g., CPUs, GPUs) across bandwidth-limited memory channels can consume more than 60% of the total energy in modern systems. To mitigate these cost… ▽ More

    Submitted 31 May, 2022; v1 submitted 29 May, 2022; originally announced May 2022.

    Comments: arXiv admin note: substantial text overlap with arXiv:2012.11890

  24. arXiv:2205.07957  [pdf

    q-bio.GN cs.AR q-bio.QM

    Going From Molecules to Genomic Variations to Scientific Discovery: Intelligent Algorithms and Architectures for Intelligent Genome Analysis

    Authors: Mohammed Alser, Joel Lindegger, Can Firtina, Nour Almadhoun, Haiyu Mao, Gagandeep Singh, Juan Gomez-Luna, Onur Mutlu

    Abstract: We now need more than ever to make genome analysis more intelligent. We need to read, analyze, and interpret our genomes not only quickly, but also accurately and efficiently enough to scale the analysis to population level. There currently exist major computational bottlenecks and inefficiencies throughout the entire genome analysis pipeline, because state-of-the-art genome sequencing technologie… ▽ More

    Submitted 16 May, 2022; originally announced May 2022.

    Comments: arXiv admin note: text overlap with arXiv:2008.00961

  25. arXiv:2205.07394  [pdf, other

    cs.AR cs.AI cs.DC cs.LG

    Sibyl: Adaptive and Extensible Data Placement in Hybrid Storage Systems Using Online Reinforcement Learning

    Authors: Gagandeep Singh, Rakesh Nadig, Jisung Park, Rahul Bera, Nastaran Ha**azar, David Novo, Juan Gómez-Luna, Sander Stuijk, Henk Corporaal, Onur Mutlu

    Abstract: Hybrid storage systems (HSS) use multiple different storage devices to provide high and scalable storage capacity at high performance. Recent research proposes various techniques that aim to accurately identify performance-critical data to place it in a "best-fit" storage device. Unfortunately, most of these techniques are rigid, which (1) limits their adaptivity to perform well for a wide range o… ▽ More

    Submitted 16 November, 2023; v1 submitted 15 May, 2022; originally announced May 2022.

  26. SeGraM: A Universal Hardware Accelerator for Genomic Sequence-to-Graph and Sequence-to-Sequence Map**

    Authors: Damla Senol Cali, Konstantinos Kanellopoulos, Joel Lindegger, Zülal Bingöl, Gurpreet S. Kalsi, Ziyi Zuo, Can Firtina, Meryem Banu Cavlak, Jeremie Kim, Nika Mansouri Ghiasi, Gagandeep Singh, Juan Gómez-Luna, Nour Almadhoun Alserr, Mohammed Alser, Sreenivas Subramoney, Can Alkan, Saugata Ghose, Onur Mutlu

    Abstract: A critical step of genome sequence analysis is the map** of sequenced DNA fragments (i.e., reads) collected from an individual to a known linear reference genome sequence (i.e., sequence-to-sequence map**). Recent works replace the linear reference sequence with a graph-based representation of the reference genome, which captures the genetic variations and diversity across many individuals in… ▽ More

    Submitted 31 May, 2022; v1 submitted 12 May, 2022; originally announced May 2022.

    Comments: To appear in ISCA'22

  27. arXiv:2204.00900  [pdf, ps, other

    cs.AR cs.DC cs.PF

    Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems

    Authors: Christina Giannoula, Ivan Fernandez, Juan Gómez-Luna, Nectarios Koziris, Georgios Goumas, Onur Mutlu

    Abstract: Several manufacturers have already started to commercialize near-bank Processing-In-Memory (PIM) architectures. Near-bank PIM architectures place simple cores close to DRAM banks and can yield significant performance and energy improvements in parallel applications by alleviating data access costs. Real PIM systems can provide high levels of parallelism, large aggregate memory bandwidth and low me… ▽ More

    Submitted 2 April, 2022; originally announced April 2022.

    Comments: arXiv admin note: substantial text overlap with arXiv:2201.05072

  28. arXiv:2203.15561  [pdf, ps, other

    cs.AR

    Algorithmic Improvement and GPU Acceleration of the GenASM Algorithm

    Authors: Joël Lindegger, Damla Senol Cali, Mohammed Alser, Juan Gómez-Luna, Onur Mutlu

    Abstract: We improve on GenASM, a recent algorithm for genomic sequence alignment, by significantly reducing its memory footprint and bandwidth requirement. Our algorithmic improvements reduce the memory footprint by 24$\times$ and the number of memory accesses by 12$\times$. We efficiently parallelize the algorithm for GPUs, achieving a 4.1$\times$ speedup over a CPU implementation of the same algorithm, a… ▽ More

    Submitted 28 March, 2022; originally announced March 2022.

    Comments: To appear at the 21st IEEE International Workshop on High Performance Computational Biology (HiCOMB) 2022

  29. arXiv:2202.02310  [pdf, other

    cs.LG cs.AR

    EcoFlow: Efficient Convolutional Dataflows for Low-Power Neural Network Accelerators

    Authors: Lois Orosa, Skanda Koppula, Yaman Umuroglu, Konstantinos Kanellopoulos, Juan Gomez-Luna, Michaela Blott, Kees Vissers, Onur Mutlu

    Abstract: Dilated and transposed convolutions are widely used in modern convolutional neural networks (CNNs). These kernels are used extensively during CNN training and inference of applications such as image segmentation and high-resolution image generation. Although these kernels have grown in popularity, they stress current compute systems due to their high memory intensity, exascale compute demands, and… ▽ More

    Submitted 4 February, 2022; originally announced February 2022.

  30. arXiv:2201.05072  [pdf, other

    cs.AR cs.DC cs.PF

    SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems

    Authors: Christina Giannoula, Ivan Fernandez, Juan Gómez-Luna, Nectarios Koziris, Georgios Goumas, Onur Mutlu

    Abstract: Several manufacturers have already started to commercialize near-bank Processing-In-Memory (PIM) architectures. Near-bank PIM architectures place simple cores close to DRAM banks and can yield significant performance and energy improvements in parallel applications by alleviating data access costs. Real PIM systems can provide high levels of parallelism, large aggregate memory bandwidth and low me… ▽ More

    Submitted 23 May, 2022; v1 submitted 13 January, 2022; originally announced January 2022.

    Comments: To appear in the Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS) 2022 and the ACM SIGMETRICS 2022 conference

  31. arXiv:2112.14216  [pdf, other

    cs.AR

    Casper: Accelerating Stencil Computation using Near-cache Processing

    Authors: Alain Denzler, Rahul Bera, Nastaran Ha**azar, Gagandeep Singh, Geraldo F. Oliveira, Juan Gómez-Luna, Onur Mutlu

    Abstract: Stencil computation is one of the most used kernels in a wide variety of scientific applications, ranging from large-scale weather prediction to solving partial differential equations. Stencil computations are characterized by three unique properties: (1) low arithmetic intensity, (2) limited temporal data reuse, and (3) regular and predictable data access pattern. As a result, stencil computation… ▽ More

    Submitted 5 September, 2023; v1 submitted 28 December, 2021; originally announced December 2021.

    ACM Class: C.3

  32. arXiv:2111.02325  [pdf, other

    cs.AR cs.PF

    Extending Memory Capacity in Consumer Devices with Emerging Non-Volatile Memory: An Experimental Study

    Authors: Geraldo F. Oliveira, Saugata Ghose, Juan Gómez-Luna, Amirali Boroumand, Alexis Savery, Sonny Rao, Salman Qazi, Gwendal Grignou, Rahul Thakur, Eric Shiu, Onur Mutlu

    Abstract: The number and diversity of consumer devices are growing rapidly, alongside their target applications' memory consumption. Unfortunately, DRAM scalability is becoming a limiting factor to the available memory capacity in consumer devices. As a potential solution, manufacturers have introduced emerging non-volatile memories (NVMs) into the market, which can be used to increase the memory capacity o… ▽ More

    Submitted 19 September, 2023; v1 submitted 3 November, 2021; originally announced November 2021.

    Comments: This paper has been accepted by IEEE Access

  33. arXiv:2110.01709  [pdf, other

    cs.AR cs.DC cs.PF

    Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing-in-Memory Hardware

    Authors: Juan Gómez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, Onur Mutlu

    Abstract: Many modern workloads such as neural network inference and graph processing are fundamentally memory-bound. For such workloads, data movement between memory and CPU cores imposes a significant overhead in terms of both latency and energy. A major reason is that this communication happens through a narrow bus with high latency and limited bandwidth, and the low data reuse in memory-bound workloads… ▽ More

    Submitted 3 April, 2023; v1 submitted 4 October, 2021; originally announced October 2021.

    Comments: Invited paper to appear at Workshop on Computing with Unconventional Technologies (CUT) 2021 https://sites.google.com/umn.edu/cut-2021/home. arXiv admin note: substantial text overlap with arXiv:2105.03814

  34. arXiv:2107.08716  [pdf, other

    cs.AR cs.DC

    Accelerating Weather Prediction using Near-Memory Reconfigurable Fabric

    Authors: Gagandeep Singh, Dionysios Diamantopoulos, Juan Gómez-Luna, Christoph Hagleitner, Sander Stuijk, Henk Corporaal, Onur Mutlu

    Abstract: Ongoing climate change calls for fast and accurate weather and climate modeling. However, when solving large-scale weather prediction simulations, state-of-the-art CPU and GPU implementations suffer from limited performance and high energy consumption. These implementations are dominated by complex irregular memory access patterns and low arithmetic intensity that pose fundamental challenges to ac… ▽ More

    Submitted 21 December, 2021; v1 submitted 19 July, 2021; originally announced July 2021.

    Comments: arXiv admin note: substantial text overlap with arXiv:2009.08241, arXiv:2106.06433

  35. FPGA-Based Near-Memory Acceleration of Modern Data-Intensive Applications

    Authors: Gagandeep Singh, Mohammed Alser, Damla Senol Cali, Dionysios Diamantopoulos, Juan Gómez-Luna, Henk Corporaal, Onur Mutlu

    Abstract: Modern data-intensive applications demand high computation capabilities with strict power constraints. Unfortunately, such applications suffer from a significant waste of both execution cycles and energy in current computing systems due to the costly data movement between the computation units and the memory units. Genome analysis and weather prediction are two examples of such applications. Recen… ▽ More

    Submitted 3 July, 2021; v1 submitted 11 June, 2021; originally announced June 2021.

    Comments: This is an extended and updated version of a paper published in IEEE Micro, vol. 41, no. 4, pp. 39-48, 1 July-Aug. 2021

  36. arXiv:2106.05632  [pdf, other

    cs.AR cs.CR

    CODIC: A Low-Cost Substrate for Enabling Custom In-DRAM Functionalities and Optimizations

    Authors: Lois Orosa, Yaohua Wang, Mohammad Sadrosadati, Jeremie S. Kim, Minesh Patel, Ivan Puddu, Haocong Luo, Kaveh Razavi, Juan Gómez-Luna, Hasan Hassan, Nika Mansouri-Ghiasi, Saugata Ghose, Onur Mutlu

    Abstract: DRAM is the dominant main memory technology used in modern computing systems. Computing systems implement a memory controller that interfaces with DRAM via DRAM commands. DRAM executes the given commands using internal components (e.g., access transistors, sense amplifiers) that are orchestrated by DRAM internal timings, which are fixed foreach DRAM command. Unfortunately, the use of fixed interna… ▽ More

    Submitted 10 June, 2021; originally announced June 2021.

    Comments: Extended version of an ISCA 2021 paper

    ACM Class: B.3; K.6.5

  37. arXiv:2105.03814  [pdf, other

    cs.AR cs.DC cs.PF

    Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture

    Authors: Juan Gómez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, Onur Mutlu

    Abstract: Many modern workloads, such as neural networks, databases, and graph processing, are fundamentally memory-bound. For such workloads, the data movement between main memory and CPU cores imposes a significant overhead in terms of both latency and energy. A major reason is that this communication happens through a narrow bus with high latency and limited bandwidth, and the low data reuse in memory-bo… ▽ More

    Submitted 4 May, 2022; v1 submitted 8 May, 2021; originally announced May 2021.

    Comments: Our open source software is available at https://github.com/CMU-SAFARI/prim-benchmarks

  38. arXiv:2105.03725  [pdf, other

    cs.AR cs.DC cs.PF

    DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

    Authors: Geraldo F. Oliveira, Juan Gómez-Luna, Lois Orosa, Saugata Ghose, Nandita Vijaykumar, Ivan Fernandez, Mohammad Sadrosadati, Onur Mutlu

    Abstract: Data movement between the CPU and main memory is a first-order obstacle against improving performance, scalability, and energy efficiency in modern systems. Computer systems employ a range of techniques to reduce overheads tied to data movement, spanning from traditional mechanisms (e.g., deep multi-level cache hierarchies, aggressive hardware prefetchers) to emerging techniques such as Near-Data… ▽ More

    Submitted 6 April, 2023; v1 submitted 8 May, 2021; originally announced May 2021.

    Comments: Our open source software is available at https://github.com/CMU-SAFARI/DAMOV

  39. pLUTo: Enabling Massively Parallel Computation in DRAM via Lookup Tables

    Authors: João Dinis Ferreira, Gabriel Falcao, Juan Gómez-Luna, Mohammed Alser, Lois Orosa, Mohammad Sadrosadati, Jeremie S. Kim, Geraldo F. Oliveira, Taha Shahroodi, Anant Nori, Onur Mutlu

    Abstract: Data movement between the main memory and the processor is a key contributor to execution time and energy consumption in memory-intensive applications. This data movement bottleneck can be alleviated using Processing-in-Memory (PiM). One category of PiM is Processing-using-Memory (PuM), in which computation takes place inside the memory array by exploiting intrinsic analog properties of the memory… ▽ More

    Submitted 3 October, 2022; v1 submitted 15 April, 2021; originally announced April 2021.

    ACM Class: B.3.1; C.1.3

    Journal ref: IEEE/ACM International Symposium on Microarchitecture (MICRO), 2022, 900-919

  40. arXiv:2101.07557  [pdf, other

    cs.AR cs.DC

    SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures

    Authors: Christina Giannoula, Nandita Vijaykumar, Nikela Papadopoulou, Vasileios Karakostas, Ivan Fernandez, Juan Gómez-Luna, Lois Orosa, Nectarios Koziris, Georgios Goumas, Onur Mutlu

    Abstract: Near-Data-Processing (NDP) architectures present a promising way to alleviate data movement costs and can provide significant performance and energy benefits to parallel applications. Typically, NDP architectures support several NDP units, each including multiple simple cores placed close to memory. To fully leverage the benefits of NDP and achieve high performance for parallel workloads, efficien… ▽ More

    Submitted 13 February, 2021; v1 submitted 19 January, 2021; originally announced January 2021.

    Comments: To appear in the 27th IEEE International Symposium on High-Performance Computer Architecture (HPCA-27)

  41. arXiv:2012.11890  [pdf, ps, other

    cs.AR cs.DC cs.ET

    SIMDRAM: A Framework for Bit-Serial SIMD Processing Using DRAM

    Authors: Nastaran Ha**azar, Geraldo F. Oliveira, Sven Gregorio, João Dinis Ferreira, Nika Mansouri Ghiasi, Minesh Patel, Mohammed Alser, Saugata Ghose, Juan Gómez-Luna, Onur Mutlu

    Abstract: Processing-using-DRAM has been proposed for a limited set of basic operations (i.e., logic operations, addition). However, in order to enable the full adoption of processing-using-DRAM, it is necessary to provide support for more complex operations. In this paper, we propose SIMDRAM, a flexible general-purpose processing-using-DRAM framework that enables massively-parallel computation of a wide ra… ▽ More

    Submitted 22 December, 2020; originally announced December 2020.

    Comments: Extended abstract of the full paper to appear in ASPLOS 2021

  42. arXiv:2012.03112  [pdf, other

    cs.AR cs.DC

    A Modern Primer on Processing in Memory

    Authors: Onur Mutlu, Saugata Ghose, Juan Gómez-Luna, Rachata Ausavarungnirun

    Abstract: Modern computing systems are overwhelmingly designed to move data to computation. This design choice goes directly against at least three key trends in computing that cause performance, scalability and energy bottlenecks: (1) data access is a key bottleneck as many important applications are increasingly data-intensive, and memory bandwidth and energy do not scale well, (2) energy consumption is a… ▽ More

    Submitted 31 August, 2022; v1 submitted 5 December, 2020; originally announced December 2020.

    Comments: arXiv admin note: substantial text overlap with arXiv:1903.03988

  43. arXiv:2010.02079  [pdf, other

    cs.AR

    NATSA: A Near-Data Processing Accelerator for Time Series Analysis

    Authors: Ivan Fernandez, Ricardo Quislant, Christina Giannoula, Mohammed Alser, Juan Gómez-Luna, Eladio Gutiérrez, Oscar Plata, Onur Mutlu

    Abstract: Time series analysis is a key technique for extracting and predicting events in domains as diverse as epidemiology, genomics, neuroscience, environmental sciences, economics, and more. Matrix profile, the state-of-the-art algorithm to perform time series analysis, computes the most similar subsequence for a given query subsequence within a sliced time series. Matrix profile has low arithmetic inte… ▽ More

    Submitted 6 October, 2020; v1 submitted 5 October, 2020; originally announced October 2020.

    Comments: To appear in the 38th IEEE International Conference on Computer Design (ICCD 2020)

  44. Accelerating Sparse Matrix-Matrix Multiplication with GPU Tensor Cores

    Authors: Orestis Zachariadis, Nitin Satpute, Juan Gómez-Luna, Joaquín Olivares

    Abstract: Sparse general matrix-matrix multiplication (spGEMM) is an essential component in many scientific and data analytics applications. However, the sparsity pattern of the input matrices and the interaction of their patterns make spGEMM challenging. Modern GPUs include Tensor Core Units (TCUs), which specialize in dense matrix multiplication. Our aim is to re-purpose TCUs for sparse matrices. The key… ▽ More

    Submitted 29 September, 2020; originally announced September 2020.

    Comments: Accepted in CAEE

    Journal ref: Comput. Electr. Eng. 88 (2020) 106848

  45. arXiv:2009.08241  [pdf, other

    cs.AR

    NERO: A Near High-Bandwidth Memory Stencil Accelerator for Weather Prediction Modeling

    Authors: Gagandeep Singh, Dionysios Diamantopoulos, Christoph Hagleitner, Juan Gomez-Luna, Sander Stuijk, Onur Mutlu, Henk Corporaal

    Abstract: Ongoing climate change calls for fast and accurate weather and climate modeling. However, when solving large-scale weather prediction simulations, state-of-the-art CPU and GPU implementations suffer from limited performance and high energy consumption. These implementations are dominated by complex irregular memory access patterns and low arithmetic intensity that pose fundamental challenges to ac… ▽ More

    Submitted 17 September, 2020; originally announced September 2020.

    Comments: This paper appears in FPL 2020

  46. arXiv:2009.07692  [pdf, other

    cs.AR q-bio.GN

    GenASM: A High-Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence Analysis

    Authors: Damla Senol Cali, Gurpreet S. Kalsi, Zülal Bingöl, Can Firtina, Lavanya Subramanian, Jeremie S. Kim, Rachata Ausavarungnirun, Mohammed Alser, Juan Gomez-Luna, Amirali Boroumand, Anant Nori, Allison Scibisz, Sreenivas Subramoney, Can Alkan, Saugata Ghose, Onur Mutlu

    Abstract: Genome sequence analysis has enabled significant advancements in medical and scientific areas such as personalized medicine, outbreak tracing, and the understanding of evolution. Unfortunately, it is currently bottlenecked by the computational power and memory bandwidth limitations of existing systems, as many of the steps in genome sequence analysis must process a large amount of data. A major co… ▽ More

    Submitted 16 September, 2020; originally announced September 2020.

    Comments: To appear in MICRO 2020

  47. Accelerating B-spline Interpolation on GPUs: Application to Medical Image Registration

    Authors: Orestis Zachariadis, Andrea Teatini, Nitin Satpute, Juan Gómez-Luna, Onur Mutlu, Ole Jakob Elle, Joaquín Olivares

    Abstract: Background and Objective. B-spline interpolation (BSI) is a popular technique in the context of medical imaging due to its adaptability and robustness in 3D object modeling. A field that utilizes BSI is Image Guided Surgery (IGS). IGS provides navigation using medical images, which can be segmented and reconstructed into 3D models, often through BSI. Image registration tasks also use BSI to align… ▽ More

    Submitted 17 April, 2020; v1 submitted 13 April, 2020; originally announced April 2020.

    Comments: Accepted in CMPB

    Journal ref: Comput. Methods Programs Biomed. 193 (2020) 105431

  48. arXiv:1910.09020  [pdf, other

    q-bio.GN cs.AR cs.DC cs.DS

    SneakySnake: A Fast and Accurate Universal Genome Pre-Alignment Filter for CPUs, GPUs, and FPGAs

    Authors: Mohammed Alser, Taha Shahroodi, Juan Gomez-Luna, Can Alkan, Onur Mutlu

    Abstract: Motivation: We introduce SneakySnake, a highly parallel and highly accurate pre-alignment filter that remarkably reduces the need for computationally costly sequence alignment. The key idea of SneakySnake is to reduce the approximate string matching (ASM) problem to the single net routing (SNR) problem in VLSI chip layout. In the SNR problem, we are interested in finding the optimal path that conn… ▽ More

    Submitted 23 November, 2020; v1 submitted 20 October, 2019; originally announced October 2019.

    Comments: To appear in Bioinformatics

    Journal ref: Bioinformatics, Apr 1;36(22-23):5282-5290, 2021

  49. arXiv:1907.12947  [pdf

    cs.DC cs.AR

    A Workload and Programming Ease Driven Perspective of Processing-in-Memory

    Authors: Saugata Ghose, Amirali Boroumand, Jeremie S. Kim, Juan Gómez-Luna, Onur Mutlu

    Abstract: Many modern and emerging applications must process increasingly large volumes of data. Unfortunately, prevalent computing paradigms are not designed to efficiently handle such large-scale data: the energy and performance costs to move this data between the memory subsystem and the CPU now dominate the total costs of computation. This forces system architects and designers to fundamentally rethink… ▽ More

    Submitted 26 July, 2019; originally announced July 2019.

  50. arXiv:1905.04376  [pdf, ps, other

    cs.DC

    Enabling Practical Processing in and near Memory for Data-Intensive Computing

    Authors: Onur Mutlu, Saugata Ghose, Juan Gómez-Luna, Rachata Ausavarungnirun

    Abstract: Modern computing systems suffer from the dichotomy between computation on one side, which is performed only in the processor (and accelerators), and data storage/movement on the other, which all other parts of the system are dedicated to. Due to this dichotomy, data moves a lot in order for the system to perform computation on it. Unfortunately, data movement is extremely expensive in terms of ene… ▽ More

    Submitted 2 May, 2019; originally announced May 2019.

    Comments: A version of this work is to appear in a DAC 2019 Special Session as an Invited Paper in June 2019. arXiv admin note: substantial text overlap with arXiv:1903.03988