-
Learned 3D volumetric recovery of clouds and its uncertainty for climate analysis
Authors:
Roi Ronen,
Ilan Koren,
Aviad Levis,
Eshkol Eytan,
Vadim Holodovsky,
Yoav Y. Schechner
Abstract:
Significant uncertainty in climate prediction and cloud physics is tied to observational gaps relating to shallow scattered clouds. Addressing these challenges requires remote sensing of their three-dimensional (3D) heterogeneous volumetric scattering content. This calls for passive scattering computed tomography (CT). We design a learning-based model (ProbCT) to achieve CT of such clouds, based o…
▽ More
Significant uncertainty in climate prediction and cloud physics is tied to observational gaps relating to shallow scattered clouds. Addressing these challenges requires remote sensing of their three-dimensional (3D) heterogeneous volumetric scattering content. This calls for passive scattering computed tomography (CT). We design a learning-based model (ProbCT) to achieve CT of such clouds, based on noisy multi-view spaceborne images. ProbCT infers - for the first time - the posterior probability distribution of the heterogeneous extinction coefficient, per 3D location. This yields arbitrary valuable statistics, e.g., the 3D field of the most probable extinction and its uncertainty. ProbCT uses a neural-field representation, making essentially real-time inference. ProbCT undergoes supervised training by a new labeled multi-class database of physics-based volumetric fields of clouds and their corresponding images. To improve out-of-distribution inference, we incorporate self-supervised learning through differential rendering. We demonstrate the approach in simulations and on real-world data, and indicate the relevance of 3D recovery and uncertainty to precipitation and renewable energy.
△ Less
Submitted 9 March, 2024;
originally announced March 2024.
-
In Search of Truth: An Interrogation Approach to Hallucination Detection
Authors:
Yakir Yehuda,
Itzik Malkiel,
Oren Barkan,
Jonathan Weill,
Royi Ronen,
Noam Koenigstein
Abstract:
Despite the many advances of Large Language Models (LLMs) and their unprecedented rapid evolution, their impact and integration into every facet of our daily lives is limited due to various reasons. One critical factor hindering their widespread adoption is the occurrence of hallucinations, where LLMs invent answers that sound realistic, yet drift away from factual truth. In this paper, we present…
▽ More
Despite the many advances of Large Language Models (LLMs) and their unprecedented rapid evolution, their impact and integration into every facet of our daily lives is limited due to various reasons. One critical factor hindering their widespread adoption is the occurrence of hallucinations, where LLMs invent answers that sound realistic, yet drift away from factual truth. In this paper, we present a novel method for detecting hallucinations in large language models, which tackles a critical issue in the adoption of these models in various real-world scenarios. Through extensive evaluations across multiple datasets and LLMs, including Llama-2, we study the hallucination levels of various recent LLMs and demonstrate the effectiveness of our method to automatically detect them. Notably, we observe up to 62% hallucinations for Llama-2 in a specific experiment, where our method achieves a Balanced Accuracy (B-ACC) of 87%, all without relying on external knowledge.
△ Less
Submitted 20 March, 2024; v1 submitted 5 March, 2024;
originally announced March 2024.
-
GRAM: Global Reasoning for Multi-Page VQA
Authors:
Tsachi Blau,
Sharon Fogel,
Roi Ronen,
Alona Golts,
Roy Ganz,
Elad Ben Avraham,
Aviad Aberdam,
Shahar Tsiper,
Ron Litman
Abstract:
The increasing use of transformer-based large language models brings forward the challenge of processing long sequences. In document visual question answering (DocVQA), leading methods focus on the single-page setting, while documents can span hundreds of pages. We present GRAM, a method that seamlessly extends pre-trained single-page models to the multi-page setting, without requiring computation…
▽ More
The increasing use of transformer-based large language models brings forward the challenge of processing long sequences. In document visual question answering (DocVQA), leading methods focus on the single-page setting, while documents can span hundreds of pages. We present GRAM, a method that seamlessly extends pre-trained single-page models to the multi-page setting, without requiring computationally-heavy pretraining. To do so, we leverage a single-page encoder for local page-level understanding, and enhance it with document-level designated layers and learnable tokens, facilitating the flow of information across pages for global reasoning. To enforce our model to utilize the newly introduced document tokens, we propose a tailored bias adaptation method. For additional computational savings during decoding, we introduce an optional compression stage using our compression-transformer (C-Former),reducing the encoded sequence length, thereby allowing a tradeoff between quality and latency. Extensive experiments showcase GRAM's state-of-the-art performance on the benchmarks for multi-page DocVQA, demonstrating the effectiveness of our approach.
△ Less
Submitted 18 March, 2024; v1 submitted 7 January, 2024;
originally announced January 2024.
-
CUDA-PIM: End-to-End Integration of Digital Processing-in-Memory from High-Level C++ to Microarchitectural Design
Authors:
Orian Leitersdorf,
Ronny Ronen,
Shahar Kvatinsky
Abstract:
Digital processing-in-memory (PIM) architectures mitigate the memory wall problem by facilitating parallel bitwise operations directly within memory. Recent works have demonstrated their algorithmic potential for accelerating data-intensive applications; however, there remains a significant gap in the programming model and microarchitectural design. This is further exacerbated by the emerging mode…
▽ More
Digital processing-in-memory (PIM) architectures mitigate the memory wall problem by facilitating parallel bitwise operations directly within memory. Recent works have demonstrated their algorithmic potential for accelerating data-intensive applications; however, there remains a significant gap in the programming model and microarchitectural design. This is further exacerbated by the emerging model of partitions, which significantly complicates control and periphery. Therefore, inspired by NVIDIA CUDA, this paper provides an end-to-end architectural integration of digital memristive PIM from an abstract high-level C++ programming interface for vector operations to the low-level microarchitecture.
We begin by proposing an efficient microarchitecture and instruction set architecture (ISA) that bridge the gap between the low-level control periphery and an abstraction of PIM parallelism into warps and threads. We subsequently propose a PIM compilation library that converts high-level C++ to ISA instructions, and a PIM driver that translates ISA instructions into PIM micro-operations. This drastically simplifies the development of PIM applications and enables PIM integration within larger existing C++ CPU/GPU programs for heterogeneous computing with significant ease.
Lastly, we present an efficient GPU-accelerated simulator for the proposed PIM microarchitecture. Although slower than a theoretical PIM chip, this simulator provides an accessible platform for developers to start executing and debugging PIM algorithms. To validate our approach, we implement state-of-the-art matrix operations and FFT PIM-based algorithms as case studies. These examples demonstrate drastically simplified development without compromising performance, showing the potential and significance of CUDA-PIM.
△ Less
Submitted 27 August, 2023;
originally announced August 2023.
-
Accelerating Relational Database Analytical Processing with Bulk-Bitwise Processing-in-Memory
Authors:
Ben Perach,
Ronny Ronen,
Shahar Kvatinsky
Abstract:
Online Analytical Processing (OLAP) for relational databases is a business decision support application. The application receives queries about the business database, usually requesting to summarize many database records, and produces few results. Existing OLAP requires transferring a large amount of data between the memory and the CPU, having a few operations per datum, and producing a small outp…
▽ More
Online Analytical Processing (OLAP) for relational databases is a business decision support application. The application receives queries about the business database, usually requesting to summarize many database records, and produces few results. Existing OLAP requires transferring a large amount of data between the memory and the CPU, having a few operations per datum, and producing a small output. Hence, OLAP is a good candidate for processing-in-memory (PIM), where computation is performed where the data is stored, thus accelerating applications by reducing data movement between the memory and CPU. In particular, bulk-bitwise PIM, where the memory array is a bit-vector processing unit, seems a good match for OLAP. With the extensive inherent parallelism and minimal data movement of bulk-bitwise PIM, OLAP applications can process the entire database in parallel in memory, transferring only the results to the CPU. This paper shows a full stack adaptation of a bulk-bitwise PIM, from compiling SQL to hardware implementation, for supporting OLAP applications. Evaluating the Star Schema Benchmark (SSB), bulk-bitwise PIM achieves a 4.65X speedup over Monet-DB, a standard database system.
△ Less
Submitted 2 July, 2023;
originally announced July 2023.
-
GPT-Calls: Enhancing Call Segmentation and Tagging by Generating Synthetic Conversations via Large Language Models
Authors:
Itzik Malkiel,
Uri Alon,
Yakir Yehuda,
Shahar Keren,
Oren Barkan,
Royi Ronen,
Noam Koenigstein
Abstract:
Transcriptions of phone calls are of significant value across diverse fields, such as sales, customer service, healthcare, and law enforcement. Nevertheless, the analysis of these recorded conversations can be an arduous and time-intensive process, especially when dealing with extended or multifaceted dialogues. In this work, we propose a novel method, GPT-distilled Calls Segmentation and Tagging…
▽ More
Transcriptions of phone calls are of significant value across diverse fields, such as sales, customer service, healthcare, and law enforcement. Nevertheless, the analysis of these recorded conversations can be an arduous and time-intensive process, especially when dealing with extended or multifaceted dialogues. In this work, we propose a novel method, GPT-distilled Calls Segmentation and Tagging (GPT-Calls), for efficient and accurate call segmentation and topic extraction. GPT-Calls is composed of offline and online phases. The offline phase is applied once to a given list of topics and involves generating a distribution of synthetic sentences for each topic using a GPT model and extracting anchor vectors. The online phase is applied to every call separately and scores the similarity between the transcripted conversation and the topic anchors found in the offline phase. Then, time domain analysis is applied to the similarity scores to group utterances into segments and tag them with topics. The proposed paradigm provides an accurate and efficient method for call segmentation and topic extraction that does not require labeled data, thus making it a versatile approach applicable to various domains. Our algorithm operates in production under Dynamics 365 Sales Conversation Intelligence, and our research is based on real sales conversations gathered from various Dynamics 365 Sales tenants.
△ Less
Submitted 9 June, 2023;
originally announced June 2023.
-
ConvPIM: Evaluating Digital Processing-in-Memory through Convolutional Neural Network Acceleration
Authors:
Orian Leitersdorf,
Ronny Ronen,
Shahar Kvatinsky
Abstract:
Processing-in-memory (PIM) architectures are emerging to reduce data movement in data-intensive applications. These architectures seek to exploit the same physical devices for both information storage and logic, thereby dwarfing the required data transfer and utilizing the full internal memory bandwidth. Whereas analog PIM utilizes the inherent connectivity of crossbar arrays for approximate matri…
▽ More
Processing-in-memory (PIM) architectures are emerging to reduce data movement in data-intensive applications. These architectures seek to exploit the same physical devices for both information storage and logic, thereby dwarfing the required data transfer and utilizing the full internal memory bandwidth. Whereas analog PIM utilizes the inherent connectivity of crossbar arrays for approximate matrix-vector multiplication in the analog domain, digital PIM architectures enable bitwise logic operations with massive parallelism across columns of data within memory arrays. Several recent works have extended the computational capabilities of digital PIM architectures towards the full-precision (single-precision floating-point) acceleration of convolutional neural networks (CNNs); yet, they lack a comprehensive comparison to GPUs. In this paper, we examine the potential of digital PIM for CNN acceleration through an updated quantitative comparison with GPUs, supplemented with an analysis of the overall limitations of digital PIM. We begin by investigating the different PIM architectures from a theoretical perspective to understand the underlying performance limitations and improvements compared to state-of-the-art hardware. We then uncover the tradeoffs between the different strategies through a series of benchmarks ranging from memory-bound vectored arithmetic to CNN acceleration. We conclude with insights into the general performance of digital PIM architectures for different data-intensive applications.
△ Less
Submitted 6 May, 2023;
originally announced May 2023.
-
FourierPIM: High-Throughput In-Memory Fast Fourier Transform and Polynomial Multiplication
Authors:
Orian Leitersdorf,
Yahav Boneh,
Gonen Gazit,
Ronny Ronen,
Shahar Kvatinsky
Abstract:
The Discrete Fourier Transform (DFT) is essential for various applications ranging from signal processing to convolution and polynomial multiplication. The groundbreaking Fast Fourier Transform (FFT) algorithm reduces DFT time complexity from the naive O(n^2) to O(n log n), and recent works have sought further acceleration through parallel architectures such as GPUs. Unfortunately, accelerators su…
▽ More
The Discrete Fourier Transform (DFT) is essential for various applications ranging from signal processing to convolution and polynomial multiplication. The groundbreaking Fast Fourier Transform (FFT) algorithm reduces DFT time complexity from the naive O(n^2) to O(n log n), and recent works have sought further acceleration through parallel architectures such as GPUs. Unfortunately, accelerators such as GPUs cannot exploit their full computing capabilities as memory access becomes the bottleneck. Therefore, this paper accelerates the FFT algorithm using digital Processing-in-Memory (PIM) architectures that shift computation into the memory by exploiting physical devices capable of storage and logic (e.g., memristors). We propose an O(log n) in-memory FFT algorithm that can also be performed in parallel across multiple arrays for high-throughput batched execution, supporting both fixed-point and floating-point numbers. Through the convolution theorem, we extend this algorithm to O(log n) polynomial multiplication - a fundamental task for applications such as cryptography. We evaluate FourierPIM on a publicly-available cycle-accurate simulator that verifies both correctness and performance, and demonstrate 5-15x throughput and 4-13x energy improvement over the NVIDIA cuFFT library on state-of-the-art GPUs for FFT and polynomial multiplication.
△ Less
Submitted 5 April, 2023;
originally announced April 2023.
-
Enabling Relational Database Analytical Processing in Bulk-Bitwise Processing-In-Memory
Authors:
Ben Perach,
Ronny Ronen,
Shahar Kvatinsky
Abstract:
Bulk-bitwise processing-in-memory (PIM), an emerging computational paradigm utilizing memory arrays as computational units, has been shown to benefit database applications. This paper demonstrates how GROUP-BY and JOIN, database operations not supported by previous works, can be performed efficiently in bulk-bitwise PIM for relational database analytical processing. We extend the gem5 simulator an…
▽ More
Bulk-bitwise processing-in-memory (PIM), an emerging computational paradigm utilizing memory arrays as computational units, has been shown to benefit database applications. This paper demonstrates how GROUP-BY and JOIN, database operations not supported by previous works, can be performed efficiently in bulk-bitwise PIM for relational database analytical processing. We extend the gem5 simulator and evaluated our hardware modifications on the Star Schema Benchmark. We show that compared to previous works, our modifications improve (on average) execution time by 1.83X, energy by 4.31X, and the system's lifetime by 3.21X. We also achieved a speedup of 4.65X over MonetDB, a modern state-of-the-art in-memory database.
△ Less
Submitted 2 November, 2023; v1 submitted 3 February, 2023;
originally announced February 2023.
-
Complex-valued Retrievals From Noisy Images Using Diffusion Models
Authors:
Nadav Torem,
Roi Ronen,
Yoav Y. Schechner,
Michael Elad
Abstract:
In diverse microscopy modalities, sensors measure only real-valued intensities. Additionally, the sensor readouts are affected by Poissonian-distributed photon noise. Traditional restoration algorithms typically aim to minimize the mean squared error (MSE) between the original and recovered images. This often leads to blurry outcomes with poor perceptual quality. Recently, deep diffusion models (D…
▽ More
In diverse microscopy modalities, sensors measure only real-valued intensities. Additionally, the sensor readouts are affected by Poissonian-distributed photon noise. Traditional restoration algorithms typically aim to minimize the mean squared error (MSE) between the original and recovered images. This often leads to blurry outcomes with poor perceptual quality. Recently, deep diffusion models (DDMs) have proven to be highly capable of sampling images from the a-posteriori probability of the sought variables, resulting in visually pleasing high-quality images. These models have mostly been suggested for real-valued images suffering from Gaussian noise. In this study, we generalize annealed Langevin Dynamics, a type of DDM, to tackle the fundamental challenges in optical imaging of complex-valued objects (and real images) affected by Poisson noise. We apply our algorithm to various optical scenarios, such as Fourier Ptychography, Phase Retrieval, and Poisson denoising. Our algorithm is evaluated on simulations and biological empirical data.
△ Less
Submitted 28 July, 2023; v1 submitted 6 December, 2022;
originally announced December 2022.
-
abstractPIM: A Technology Backward-Compatible Compilation Flow for Processing-In-Memory
Authors:
Adi Eliahu,
Rotem Ben-Hur,
Ronny Ronen,
Shahar Kvatinsky
Abstract:
The von Neumann architecture, in which the memory and the computation units are separated, demands massive data traffic between the memory and the CPU. To reduce data movement, new technologies and computer architectures have been explored. The use of memristors, which are devices with both memory and computation capabilities, has been considered for different processing-in-memory (PIM) solutions,…
▽ More
The von Neumann architecture, in which the memory and the computation units are separated, demands massive data traffic between the memory and the CPU. To reduce data movement, new technologies and computer architectures have been explored. The use of memristors, which are devices with both memory and computation capabilities, has been considered for different processing-in-memory (PIM) solutions, including using memristive stateful logic for a programmable digital PIM system. Nevertheless, all previous work has focused on a specific stateful logic family, and on optimizing the execution for a certain target machine. These solutions require new compiler and compilation when changing the target machine, and provide no backward compatibility with other target machines. In this chapter, we present abstractPIM, a new compilation concept and flow which enables executing any function within the memory, using different stateful logic families and different instruction set architectures (ISAs). By separating the code generation into two independent components, intermediate representation of the code using target independent ISA and then microcode generation for a specific target machine, we provide a flexible flow with backward compatibility and lay foundations for a PIM compiler. Using abstractPIM, we explore various logic technologies and ISAs and how they impact each other, and discuss the challenges associated with it, such as the increase in execution time.
△ Less
Submitted 30 August, 2022;
originally announced August 2022.
-
GLASS: Global to Local Attention for Scene-Text Spotting
Authors:
Roi Ronen,
Shahar Tsiper,
Oron Anschel,
Inbal Lavi,
Amir Markovitz,
R. Manmatha
Abstract:
In recent years, the dominant paradigm for text spotting is to combine the tasks of text detection and recognition into a single end-to-end framework. Under this paradigm, both tasks are accomplished by operating over a shared global feature map extracted from the input image. Among the main challenges that end-to-end approaches face is the performance degradation when recognizing text across scal…
▽ More
In recent years, the dominant paradigm for text spotting is to combine the tasks of text detection and recognition into a single end-to-end framework. Under this paradigm, both tasks are accomplished by operating over a shared global feature map extracted from the input image. Among the main challenges that end-to-end approaches face is the performance degradation when recognizing text across scale variations (smaller or larger text), and arbitrary word rotation angles. In this work, we address these challenges by proposing a novel global-to-local attention mechanism for text spotting, termed GLASS, that fuses together global and local features. The global features are extracted from the shared backbone, preserving contextual information from the entire image, while the local features are computed individually on resized, high-resolution rotated word crops. The information extracted from the local crops alleviates much of the inherent difficulties with scale and word rotation. We show a performance analysis across scales and angles, highlighting improvement over scale and angle extremities. In addition, we introduce an orientation-aware loss term supervising the detection task, and show its contribution to both detection and recognition performance across all angles. Finally, we show that GLASS is general by incorporating it into other leading text spotting architectures, improving their text spotting performance. Our method achieves state-of-the-art results on multiple benchmarks, including the newly released TextOCR.
△ Less
Submitted 5 August, 2022;
originally announced August 2022.
-
MatPIM: Accelerating Matrix Operations with Memristive Stateful Logic
Authors:
Orian Leitersdorf,
Ronny Ronen,
Shahar Kvatinsky
Abstract:
The emerging memristive Memory Processing Unit (mMPU) overcomes the memory wall through memristive devices that unite storage and logic for real processing-in-memory (PIM) systems. At the core of the mMPU is stateful logic, which is accelerated with memristive partitions to enable logic with massive inherent parallelism within crossbar arrays. This paper vastly accelerates the fundamental operatio…
▽ More
The emerging memristive Memory Processing Unit (mMPU) overcomes the memory wall through memristive devices that unite storage and logic for real processing-in-memory (PIM) systems. At the core of the mMPU is stateful logic, which is accelerated with memristive partitions to enable logic with massive inherent parallelism within crossbar arrays. This paper vastly accelerates the fundamental operations of matrix-vector multiplication and convolution in the mMPU, with either full-precision or binary elements. These proposed algorithms establish an efficient foundation for large-scale mMPU applications such as neural-networks, image processing, and numerical methods. We overcome the inherent asymmetry limitation in the previous in-memory full-precision matrix-vector multiplication solutions by utilizing techniques from block matrix multiplication and reduction. We present the first fast in-memory binary matrix-vector multiplication algorithm by utilizing memristive partitions with a tree-based popcount reduction (39x faster than previous work). For convolution, we present a novel in-memory input-parallel concept which we utilize for a full-precision algorithm that overcomes the asymmetry limitation in convolution, while also improving latency (2x faster than previous work), and the first fast binary algorithm (12x faster than previous work).
△ Less
Submitted 30 June, 2022;
originally announced June 2022.
-
AritPIM: High-Throughput In-Memory Arithmetic
Authors:
Orian Leitersdorf,
Dean Leitersdorf,
Jonathan Gal,
Mor Dahan,
Ronny Ronen,
Shahar Kvatinsky
Abstract:
Digital processing-in-memory (PIM) architectures are rapidly emerging to overcome the memory-wall bottleneck by integrating logic within memory elements. Such architectures provide vast computational power within the memory itself in the form of parallel bitwise logic operations. We develop novel algorithmic techniques for PIM that, combined with new perspectives on computer arithmetic, extend thi…
▽ More
Digital processing-in-memory (PIM) architectures are rapidly emerging to overcome the memory-wall bottleneck by integrating logic within memory elements. Such architectures provide vast computational power within the memory itself in the form of parallel bitwise logic operations. We develop novel algorithmic techniques for PIM that, combined with new perspectives on computer arithmetic, extend this bitwise parallelism to the four fundamental arithmetic operations (addition, subtraction, multiplication, and division), for both fixed-point and floating-point numbers, and using both bit-serial and bit-parallel approaches. We propose a state-of-the-art suite of arithmetic algorithms, demonstrating the first algorithm in the literature of digital PIM for a majority of cases - including cases previously considered impossible for digital PIM, such as floating-point addition. Through a case study on memristive PIM, we compare the proposed algorithms to an NVIDIA RTX 3070 GPU and demonstrate significant throughput and energy improvements.
△ Less
Submitted 15 April, 2023; v1 submitted 8 June, 2022;
originally announced June 2022.
-
PartitionPIM: Practical Memristive Partitions for Fast Processing-in-Memory
Authors:
Orian Leitersdorf,
Ronny Ronen,
Shahar Kvatinsky
Abstract:
Digital memristive processing-in-memory overcomes the memory wall through a fundamental storage device capable of stateful logic within crossbar arrays. Dynamically dividing the crossbar arrays by adding memristive partitions further increases parallelism, thereby overcoming an inherent trade-off in memristive processing-in-memory. The algorithmic topology of partitions is highly unique, and was r…
▽ More
Digital memristive processing-in-memory overcomes the memory wall through a fundamental storage device capable of stateful logic within crossbar arrays. Dynamically dividing the crossbar arrays by adding memristive partitions further increases parallelism, thereby overcoming an inherent trade-off in memristive processing-in-memory. The algorithmic topology of partitions is highly unique, and was recently exploited to accelerate multiplication (11x with 32 partitions) and sorting (14x with 16 partitions). Yet, the physical implementation of memristive partitions, such as the peripheral decoders and the control message, has never been considered and may lead to vast impracticality. This paper overcomes that challenge with several novel techniques, presenting efficient practical designs of memristive partitions. We begin by formalizing the algorithmic properties of memristive partitions into serial, parallel, and semi-parallel operations. Peripheral overhead is addressed via a novel technique of half-gates that enables efficient decoding with negligible overhead. Control overhead is addressed by carefully reducing the operation set of memristive partitions, while resulting in negligible performance impact, by utilizing techniques such as shared indices and pattern generators. Ultimately, these efficient practical solutions, combined with the vast algorithmic potential, may revolutionize digital memristive processing-in-memory.
△ Less
Submitted 8 June, 2022;
originally announced June 2022.
-
FiltPIM: In-Memory Filter for DNA Sequencing
Authors:
Marcel Khalifa,
Rotem Ben-Hur,
Ronny Ronen,
Orian Leitersdorf,
Leonid Yavits,
Shahar Kvatinsky
Abstract:
Aligning the entire genome of an organism is a compute-intensive task. Pre-alignment filters substantially reduce computation complexity by filtering potential alignment locations. The base-count filter successfully removes over 68% of the potential locations through a histogram-based heuristic. This paper presents FiltPIM, an efficient design of the basecount filter that is based on memristive pr…
▽ More
Aligning the entire genome of an organism is a compute-intensive task. Pre-alignment filters substantially reduce computation complexity by filtering potential alignment locations. The base-count filter successfully removes over 68% of the potential locations through a histogram-based heuristic. This paper presents FiltPIM, an efficient design of the basecount filter that is based on memristive processing-in-memory. The in-memory design reduces CPU-to-memory data transfer and utilizes both intra-crossbar and inter-crossbar memristive stateful-logic parallelism. The reduction in data transfer and the efficient stateful-logic computation together improve filtering time by 100x compared to a CPU implementation of the filter.
△ Less
Submitted 2 June, 2022; v1 submitted 30 May, 2022;
originally announced May 2022.
-
HashPIM: High-Throughput SHA-3 via Memristive Digital Processing-in-Memory
Authors:
Batel Oved,
Orian Leitersdorf,
Ronny Ronen,
Shahar Kvatinsky
Abstract:
Recent research has sought to accelerate cryptographic hash functions as they are at the core of modern cryptography. Traditional designs, however, suffer from the von Neumann bottleneck that originates from the separation of processing and memory units. An emerging solution to overcome this bottleneck is processing-in-memory (PIM): performing logic within the same devices responsible for memory t…
▽ More
Recent research has sought to accelerate cryptographic hash functions as they are at the core of modern cryptography. Traditional designs, however, suffer from the von Neumann bottleneck that originates from the separation of processing and memory units. An emerging solution to overcome this bottleneck is processing-in-memory (PIM): performing logic within the same devices responsible for memory to eliminate data-transfer and simultaneously provide massive computational parallelism. In this paper, we seek to vastly accelerate the state-of-the-art SHA-3 cryptographic function using the memristive memory processing unit (mMPU), a general-purpose memristive PIM architecture. To that end, we propose a novel in-memory algorithm for variable rotation, and utilize an efficient map** of the SHA-3 state vector for memristive crossbar arrays to efficiently exploit PIM parallelism. We demonstrate a massive energy efficiency of 1,422 Gbps/W, improving a state-of-the-art memristive SHA-3 accelerator (SHINE-2) by 4.6x.
△ Less
Submitted 1 June, 2022; v1 submitted 26 May, 2022;
originally announced May 2022.
-
Do You Think You Can Hold Me? The Real Challenge of Problem-Space Evasion Attacks
Authors:
Harel Berger,
Amit Dvir,
Chen Hajaj,
Rony Ronen
Abstract:
Android malware is a spreading disease in the virtual world. Anti-virus and detection systems continuously undergo patches and updates to defend against these threats. Most of the latest approaches in malware detection use Machine Learning (ML). Against the robustifying effort of detection systems, raise the \emph{evasion attacks}, where an adversary changes its targeted samples so that they are m…
▽ More
Android malware is a spreading disease in the virtual world. Anti-virus and detection systems continuously undergo patches and updates to defend against these threats. Most of the latest approaches in malware detection use Machine Learning (ML). Against the robustifying effort of detection systems, raise the \emph{evasion attacks}, where an adversary changes its targeted samples so that they are misclassified as benign. This paper considers two kinds of evasion attacks: feature-space and problem-space. \emph{Feature-space} attacks consider an adversary who manipulates ML features to evade the correct classification while minimizing or constraining the total manipulations. \textit{Problem-space} attacks refer to evasion attacks that change the actual sample. Specifically, this paper analyzes the gap between these two types in the Android malware domain. The gap between the two types of evasion attacks is examined via the retraining process of classifiers using each one of the evasion attack types. The experiments show that the gap between these two types of retrained classifiers is dramatic and may increase to 96\%. Retrained classifiers of feature-space evasion attacks have been found to be either less effective or completely ineffective against problem-space evasion attacks. Additionally, exploration of different problem-space evasion attacks shows that retraining of one problem-space evasion attack may be effective against other problem-space evasion attacks.
△ Less
Submitted 9 May, 2022;
originally announced May 2022.
-
An End-to-End Dialogue Summarization System for Sales Calls
Authors:
Abedelkadir Asi,
Song Wang,
Roy Eisenstadt,
Dean Geckt,
Yarin Kuper,
Yi Mao,
Royi Ronen
Abstract:
Summarizing sales calls is a routine task performed manually by salespeople. We present a production system which combines generative models fine-tuned for customer-agent setting, with a human-in-the-loop user experience for an interactive summary curation process. We address challenging aspects of dialogue summarization task in a real-world setting including long input dialogues, content validati…
▽ More
Summarizing sales calls is a routine task performed manually by salespeople. We present a production system which combines generative models fine-tuned for customer-agent setting, with a human-in-the-loop user experience for an interactive summary curation process. We address challenging aspects of dialogue summarization task in a real-world setting including long input dialogues, content validation, lack of labeled data and quality evaluation. We show how GPT-3 can be leveraged as an offline data labeler to handle training data scarcity and accommodate privacy constraints in an industrial setting. Experiments show significant improvements by our models in tackling the summarization and content validation tasks on public datasets.
△ Less
Submitted 28 April, 2022; v1 submitted 27 April, 2022;
originally announced April 2022.
-
Understanding Bulk-Bitwise Processing In-Memory Through Database Analytics
Authors:
Ben Perach,
Ronny Ronen,
Benny Kimelfeld,
Shahar Kvatinsky
Abstract:
Bulk-bitwise processing-in-memory (PIM), where large bitwise operations are performed in parallel by the memory array itself, is an emerging form of computation with the potential to mitigate the memory wall problem. This paper examines the capabilities of bulk-bitwise PIM by constructing PIMDB, a fully-digital system based on memristive stateful logic, utilizing and focusing on in-memory bulk-bit…
▽ More
Bulk-bitwise processing-in-memory (PIM), where large bitwise operations are performed in parallel by the memory array itself, is an emerging form of computation with the potential to mitigate the memory wall problem. This paper examines the capabilities of bulk-bitwise PIM by constructing PIMDB, a fully-digital system based on memristive stateful logic, utilizing and focusing on in-memory bulk-bitwise operations, designed to accelerate a real-life workload: analytical processing of relational databases. We introduce a host processor programming model to support bulk-bitwise PIM in virtual memory, develop techniques to efficiently perform in-memory filtering and aggregation operations, and adapt the application data set into the memory. To understand bulk-bitwise PIM, we compare it to an equivalent in-memory database on the same host system. We show that bulk-bitwise PIM substantially lowers the number of required memory read operations, thus accelerating TPC-H filter operations by 1.6$\times$--18$\times$ and full queries by 56$\times$--608$\times$, while reducing the energy consumption by 1.7$\times$--18.6$\times$ and 0.81$\times$--12$\times$ for these benchmarks, respectively. Our extensive evaluation uses the gem5 full-system simulation environment. The simulations also evaluate cell endurance, showing that the required endurance is within the range of existing endurance of RRAM devices.
△ Less
Submitted 26 September, 2023; v1 submitted 20 March, 2022;
originally announced March 2022.
-
Efficient Training of the Memristive Deep Belief Net Immune to Non-Idealities of the Synaptic Devices
Authors:
Wei Wang,
Barak Hoffer,
Tzofnat Greenberg-Toledo,
Yang Li,
Minhui Zou,
Eric Herbelin,
Ronny Ronen,
Xiaoxin Xu,
Yulin Zhao,
Jianguo Yang,
Shahar Kvatinsky
Abstract:
The tunability of conductance states of various emerging non-volatile memristive devices emulates the plasticity of biological synapses, making it promising in the hardware realization of large-scale neuromorphic systems. The inference of the neural network can be greatly accelerated by the vector-matrix multiplication (VMM) performed within a crossbar array of memristive devices in one step. Neve…
▽ More
The tunability of conductance states of various emerging non-volatile memristive devices emulates the plasticity of biological synapses, making it promising in the hardware realization of large-scale neuromorphic systems. The inference of the neural network can be greatly accelerated by the vector-matrix multiplication (VMM) performed within a crossbar array of memristive devices in one step. Nevertheless, the implementation of the VMM needs complex peripheral circuits and the complexity further increases since non-idealities of memristive devices prevent precise conductance tuning (especially for the online training) and largely degrade the performance of the deep neural networks (DNNs). Here, we present an efficient online training method of the memristive deep belief net (DBN). The proposed memristive DBN uses stochastically binarized activations, reducing the complexity of peripheral circuits, and uses the contrastive divergence (CD) based gradient descent learning algorithm. The analog VMM and digital CD are performed separately in a mixed-signal hardware arrangement, making the memristive DBN high immune to non-idealities of synaptic devices. The number of write operations on memristive devices is reduced by two orders of magnitude. The recognition accuracy of 95%~97% can be achieved for the MNIST dataset using pulsed synaptic behaviors of various memristive synaptic devices.
△ Less
Submitted 15 March, 2022;
originally announced March 2022.
-
Making Memristive Processing-in-Memory Reliable
Authors:
Orian Leitersdorf,
Ronny Ronen,
Shahar Kvatinsky
Abstract:
Processing-in-memory (PIM) solutions vastly accelerate systems by reducing data transfer between computation and memory. Memristors possess a unique property that enables storage and logic within the same device, which is exploited in the memristive Memory Processing Unit (mMPU). The mMPU expands fundamental stateful logic techniques, such as IMPLY, MAGIC and FELIX, to high-throughput parallel log…
▽ More
Processing-in-memory (PIM) solutions vastly accelerate systems by reducing data transfer between computation and memory. Memristors possess a unique property that enables storage and logic within the same device, which is exploited in the memristive Memory Processing Unit (mMPU). The mMPU expands fundamental stateful logic techniques, such as IMPLY, MAGIC and FELIX, to high-throughput parallel logic and arithmetic operations within the memory. Unfortunately, memristive processing-in-memory is highly vulnerable to soft errors and this massive parallelism is not compatible with traditional reliability techniques, such as error-correcting-code (ECC). In this paper, we discuss reliability techniques that efficiently support the mMPU by utilizing the same principles as the mMPU computation. We detail ECC techniques that are based on the unique properties of the mMPU to efficiently utilize the massive parallelism. Furthermore, we present novel solutions for efficiently implementing triple modular redundancy (TMR). The short-term and long-term reliability of large-scale applications, such as neural-network acceleration, are evaluated. The analysis clearly demonstrates the importance of high-throughput reliability mechanisms for memristive processing-in-memory.
△ Less
Submitted 20 September, 2021;
originally announced September 2021.
-
MultPIM: Fast Stateful Multiplication for Processing-in-Memory
Authors:
Orian Leitersdorf,
Ronny Ronen,
Shahar Kvatinsky
Abstract:
Processing-in-memory (PIM) seeks to eliminate computation/memory data transfer using devices that support both storage and logic. Stateful logic techniques such as IMPLY, MAGIC and FELIX can perform logic gates within memristive crossbar arrays with massive parallelism. Multiplication via stateful logic is an active field of research due to the wide implications. Recently, RIME has become the stat…
▽ More
Processing-in-memory (PIM) seeks to eliminate computation/memory data transfer using devices that support both storage and logic. Stateful logic techniques such as IMPLY, MAGIC and FELIX can perform logic gates within memristive crossbar arrays with massive parallelism. Multiplication via stateful logic is an active field of research due to the wide implications. Recently, RIME has become the state-of-the-art algorithm for stateful single-row multiplication by using memristive partitions, reducing the latency of the previous state-of-the-art by 5.1x. In this paper, we begin by proposing novel partition-based computation techniques for broadcasting and shifting data. Then, we design an in-memory multiplication algorithm based on the carry-save add-shift (CSAS) technique. Finally, we develop a novel stateful full-adder that significantly improves the state-of-the-art (FELIX) design. These contributions constitute MultPIM, a multiplier that reduces state-of-the-art time complexity from quadratic to linear-log. For 32-bit numbers, MultPIM improves latency by an additional 4.2x over RIME, while even slightly reducing area overhead. Furthermore, we optimize MultPIM for full-precision matrix-vector multiplication and improve latency by 25.5x over FloatPIM matrix-vector multiplication.
△ Less
Submitted 20 September, 2021; v1 submitted 30 August, 2021;
originally announced August 2021.
-
The Bitlet Model: A Parameterized Analytical Model to Compare PIM and CPU Systems
Authors:
Ronny Ronen,
Adi Eliahu,
Orian Leitersdorf,
Natan Peled,
Kunal Korgaonkar,
Anupam Chattopadhyay,
Ben Perach,
Shahar Kvatinsky
Abstract:
Nowadays, data-intensive applications are gaining popularity and, together with this trend, processing-in-memory (PIM)-based systems are being given more attention and have become more relevant. This paper describes an analytical modeling tool called Bitlet that can be used, in a parameterized fashion, to estimate the performance and the power/energy of a PIM-based system and thereby assess the af…
▽ More
Nowadays, data-intensive applications are gaining popularity and, together with this trend, processing-in-memory (PIM)-based systems are being given more attention and have become more relevant. This paper describes an analytical modeling tool called Bitlet that can be used, in a parameterized fashion, to estimate the performance and the power/energy of a PIM-based system and thereby assess the affinity of workloads for PIM as opposed to traditional computing. The tool uncovers interesting tradeoffs between, mainly, the PIM computation complexity (cycles required to perform a computation through PIM), the amount of memory used for PIM, the system memory bandwidth, and the data transfer size. Despite its simplicity, the model reveals new insights when applied to real-life examples. The model is demonstrated for several synthetic examples and then applied to explore the influence of different parameters on two systems - IMAGING and FloatPIM. Based on the demonstrations, insights about PIM and its combination with CPU are concluded.
△ Less
Submitted 21 July, 2021;
originally announced July 2021.
-
Efficient Error-Correcting-Code Mechanism for High-Throughput Memristive Processing-in-Memory
Authors:
Orian Leitersdorf,
Ben Perach,
Ronny Ronen,
Shahar Kvatinsky
Abstract:
Inefficient data transfer between computation and memory inspired emerging processing-in-memory (PIM) technologies. Many PIM solutions enable storage and processing using memristors in a crossbar-array structure, with techniques such as memristor-aided logic (MAGIC) used for computation. This approach provides highly-paralleled logic computation with minimal data movement. However, memristors are…
▽ More
Inefficient data transfer between computation and memory inspired emerging processing-in-memory (PIM) technologies. Many PIM solutions enable storage and processing using memristors in a crossbar-array structure, with techniques such as memristor-aided logic (MAGIC) used for computation. This approach provides highly-paralleled logic computation with minimal data movement. However, memristors are vulnerable to soft errors and standard error-correcting-code (ECC) techniques are difficult to implement without moving data outside the memory. We propose a novel technique for efficient ECC implementation along diagonals to support reliable computation inside the memory without explicitly reading the data. Our evaluation demonstrates an improvement of over eight orders of magnitude in reliability (mean time to failure) for an increase of about 26% in computation latency.
△ Less
Submitted 10 May, 2021;
originally announced May 2021.
-
Spatiotemporal tomography based on scattered multiangular signals and its application for resolving evolving clouds using moving platforms
Authors:
Roi Ronen,
Yoav Y. Schechner,
Eshkol Eytan
Abstract:
We derive computed tomography (CT) of a time-varying volumetric translucent object, using a small number of moving cameras. We particularly focus on passive scattering tomography, which is a non-linear problem. We demonstrate the approach on dynamic clouds, as clouds have a major effect on Earth's climate. State of the art scattering CT assumes a static object. Existing 4D CT methods rely on a lin…
▽ More
We derive computed tomography (CT) of a time-varying volumetric translucent object, using a small number of moving cameras. We particularly focus on passive scattering tomography, which is a non-linear problem. We demonstrate the approach on dynamic clouds, as clouds have a major effect on Earth's climate. State of the art scattering CT assumes a static object. Existing 4D CT methods rely on a linear image formation model and often on significant priors. In this paper, the angular and temporal sampling rates needed for a proper recovery are discussed. If these rates are used, the paper leads to a representation of the time-varying object, which simplifies 4D CT tomography. The task is achieved using gradient-based optimization. We demonstrate this in physics-based simulations and in an experiment that had yielded real-world data.
△ Less
Submitted 6 December, 2020;
originally announced December 2020.
-
CONTRA: Area-Constrained Technology Map** Framework For Memristive Memory Processing Unit
Authors:
Debjyoti Bhattacharjee,
Anupam Chattopadhyay,
Srijit Dutta,
Ronny Ronen,
Shahar Kvatinsky
Abstract:
Data-intensive applications are poised to benefit directly from processing-in-memory platforms, such as memristive Memory Processing Units, which allow leveraging data locality and performing stateful logic operations. Develo** design automation flows for such platforms is a challenging and highly relevant research problem. In this work, we investigate the problem of minimizing delay under arbit…
▽ More
Data-intensive applications are poised to benefit directly from processing-in-memory platforms, such as memristive Memory Processing Units, which allow leveraging data locality and performing stateful logic operations. Develo** design automation flows for such platforms is a challenging and highly relevant research problem. In this work, we investigate the problem of minimizing delay under arbitrary area constraint for MAGIC-based in-memory computing platforms. We propose an end-to-end area constrained technology map** framework, CONTRA. CONTRA uses Look-Up Table(LUT) based map** of the input function on the crossbar array to maximize parallel operations and uses a novel search technique to move data optimally inside the array. CONTRA supports benchmarks in a variety of formats, along with crossbar dimensions as input to generate MAGIC instructions. CONTRA scales for large benchmarks, as demonstrated by our experiments. CONTRA allows map** benchmarks to smaller crossbar dimensions than achieved by any other technique before, while allowing a wide variety of area-delay trade-offs. CONTRA improves the composite metric of area-delay product by 2.1x to 13.1x compared to seven existing technology map** approaches.
△ Less
Submitted 2 September, 2020;
originally announced September 2020.
-
The Bitlet Model: Defining a Litmus Test for the Bitwise Processing-in-Memory Paradigm
Authors:
Kunal Korgaonkar,
Ronny Ronen,
Anupam Chattopadhyay,
Shahar Kvatinsky
Abstract:
This paper describes an analytical modeling tool called Bitlet that can be used, in a parameterized fashion, to understand the affinity of workloads to processing-in-memory (PIM) as opposed to traditional computing. The tool uncovers interesting trade-offs between operation complexity (cycles required to perform an operation through PIM) and other key parameters, such as system memory bandwidth, d…
▽ More
This paper describes an analytical modeling tool called Bitlet that can be used, in a parameterized fashion, to understand the affinity of workloads to processing-in-memory (PIM) as opposed to traditional computing. The tool uncovers interesting trade-offs between operation complexity (cycles required to perform an operation through PIM) and other key parameters, such as system memory bandwidth, data transfer size, the extent of data alignment, and effective memory capacity involved in PIM computations. Despite its simplicity, the model has already proven useful. In the future, we intend to extend and refine Bitlet to further increase its utility.
△ Less
Submitted 22 October, 2019;
originally announced October 2019.
-
Microsoft Malware Classification Challenge
Authors:
Royi Ronen,
Marian Radu,
Corina Feuerstein,
Elad Yom-Tov,
Mansour Ahmadi
Abstract:
The Microsoft Malware Classification Challenge was announced in 2015 along with a publication of a huge dataset of nearly 0.5 terabytes, consisting of disassembly and bytecode of more than 20K malware samples. Apart from serving in the Kaggle competition, the dataset has become a standard benchmark for research on modeling malware behaviour. To date, the dataset has been cited in more than 50 rese…
▽ More
The Microsoft Malware Classification Challenge was announced in 2015 along with a publication of a huge dataset of nearly 0.5 terabytes, consisting of disassembly and bytecode of more than 20K malware samples. Apart from serving in the Kaggle competition, the dataset has become a standard benchmark for research on modeling malware behaviour. To date, the dataset has been cited in more than 50 research papers. Here we provide a high-level comparison of the publications citing the dataset. The comparison simplifies finding potential research directions in this field and future performance evaluation of the dataset.
△ Less
Submitted 22 February, 2018;
originally announced February 2018.
-
Learning to Customize Network Security Rules
Authors:
Michael Bargury,
Roy Levin,
Royi Ronen
Abstract:
Security is a major concern for organizations who wish to leverage cloud computing. In order to reduce security vulnerabilities, public cloud providers offer firewall functionalities. When properly configured, a firewall protects cloud networks from cyber-attacks. However, proper firewall configuration requires intimate knowledge of the protected system, high expertise and on-going maintenance.…
▽ More
Security is a major concern for organizations who wish to leverage cloud computing. In order to reduce security vulnerabilities, public cloud providers offer firewall functionalities. When properly configured, a firewall protects cloud networks from cyber-attacks. However, proper firewall configuration requires intimate knowledge of the protected system, high expertise and on-going maintenance.
As a result, many organizations do not use firewalls effectively, leaving their cloud resources vulnerable. In this paper, we present a novel supervised learning method, and prototype, which compute recommendations for firewall rules. Recommendations are based on sampled network traffic meta-data (NetFlow) collected from a public cloud provider. Labels are extracted from firewall configurations deemed to be authored by experts. NetFlow is collected from network routers, avoiding expensive collection from cloud VMs, as well as relieving privacy concerns.
The proposed method captures network routines and dependencies between resources and firewall configuration. The method predicts IPs to be allowed by the firewall. A grou** algorithm is subsequently used to generate a manageable number of IP ranges. Each range is a parameter for a firewall rule.
We present results of experiments on real data, showing ROC AUC of 0.92, compared to 0.58 for an unsupervised baseline. The results prove the hypothesis that firewall rules can be automatically generated based on router data, and that an automated method can be effective in blocking a high percentage of malicious traffic.
△ Less
Submitted 28 December, 2017;
originally announced December 2017.
-
Why & When Deep Learning Works: Looking Inside Deep Learnings
Authors:
Ronny Ronen
Abstract:
The Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI) has been heavily supporting Machine Learning and Deep Learning research from its foundation in 2012. We have asked six leading ICRI-CI Deep Learning researchers to address the challenge of "Why & When Deep Learning works", with the goal of looking inside Deep Learning, providing insights on how deep networks functi…
▽ More
The Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI) has been heavily supporting Machine Learning and Deep Learning research from its foundation in 2012. We have asked six leading ICRI-CI Deep Learning researchers to address the challenge of "Why & When Deep Learning works", with the goal of looking inside Deep Learning, providing insights on how deep networks function, and uncovering key observations on their expressiveness, limitations, and potential. The output of this challenge resulted in five papers that address different facets of deep learning. These different facets include a high-level understating of why and when deep networks work (and do not work), the impact of geometry on the expressiveness of deep networks, and making deep networks interpretable.
△ Less
Submitted 10 May, 2017;
originally announced May 2017.
-
Misassembly Detection using Paired-End Sequence Reads and Optical Map** Data
Authors:
Martin D. Muggli,
Simon J. Puglisi,
Roy Ronen,
Christina Boucher
Abstract:
A crucial problem in genome assembly is the discovery and correction of misassembly errors in draft genomes. We develop a method that will enhance the quality of draft genomes by identifying and removing misassembly errors using paired short read sequence data and optical map** data. We apply our method to various assemblies of the loblolly pine and Francisella tularensis genomes. Our results de…
▽ More
A crucial problem in genome assembly is the discovery and correction of misassembly errors in draft genomes. We develop a method that will enhance the quality of draft genomes by identifying and removing misassembly errors using paired short read sequence data and optical map** data. We apply our method to various assemblies of the loblolly pine and Francisella tularensis genomes. Our results demonstrate that we detect more than 54% of extensively misassembled contigs and more than 60% of locally misassembed contigs in an assembly of Francisella tularensis, and between 31% and 100% of extensively misassembled contigs and between 57% and 73% of locally misassembed contigs in the assemblies of loblolly pine. MISSEQUEL can be downloaded at http://www.cs.colostate.edu/seq/.
△ Less
Submitted 20 November, 2014;
originally announced November 2014.