-
Inbetween: Visual Selection in Parametric Design
Authors:
Rony Ginosar,
Amit Zoran
Abstract:
The act of selection plays a leading role in the design process and in the definition of personal style. This work introduces visual selection catalogs into parametric design environments. A two-fold contribution is presented: (i) guidelines for construction of a minimal-bias visual selection catalog from a parametric space, and (ii) Inbetween, a catalog for a parametric typeface that adheres to t…
▽ More
The act of selection plays a leading role in the design process and in the definition of personal style. This work introduces visual selection catalogs into parametric design environments. A two-fold contribution is presented: (i) guidelines for construction of a minimal-bias visual selection catalog from a parametric space, and (ii) Inbetween, a catalog for a parametric typeface that adheres to the guidelines, allows for font selection from a continuous design space, and enables the investigation of personal style. A user study conducted among graphic designers, revealed self-coherent characteristics in selection patterns, and a high correlation in selection patterns within tasks. These findings suggest that such patterns reflect personal user styles, formalizing the style selection process as traversals of decision trees. Together, our guidelines and catalog aid in making visual selection a key building block in the digital creation process and validate selection processes as a measure of personal style.
△ Less
Submitted 4 June, 2022;
originally announced June 2022.
-
WoLFRaM: Enhancing Wear-Leveling and Fault Tolerance in Resistive Memories using Programmable Address Decoders
Authors:
Leonid Yavits,
Lois Orosa,
Suyash Mahar,
João Dinis Ferreira,
Mattan Erez,
Ran Ginosar,
Onur Mutlu
Abstract:
Resistive memories have limited lifetime caused by limited write endurance and highly non-uniform write access patterns. Two main techniques to mitigate endurance-related memory failures are 1) wear-leveling, to evenly distribute the writes across the entire memory, and 2) fault tolerance, to correct memory cell failures. However, one of the main open challenges in extending the lifetime of existi…
▽ More
Resistive memories have limited lifetime caused by limited write endurance and highly non-uniform write access patterns. Two main techniques to mitigate endurance-related memory failures are 1) wear-leveling, to evenly distribute the writes across the entire memory, and 2) fault tolerance, to correct memory cell failures. However, one of the main open challenges in extending the lifetime of existing resistive memories is to make both techniques work together seamlessly and efficiently. To address this challenge, we propose WoLFRaM, a new mechanism that combines both wear-leveling and fault tolerance techniques at low cost by using a programmable resistive address decoder (PRAD). The key idea of WoLFRaM is to use PRAD for implementing 1) a new efficient wear-leveling mechanism that remaps write accesses to random physical locations on the fly, and 2) a new efficient fault tolerance mechanism that recovers from faults by remap** failed memory blocks to available physical locations. Our evaluations show that, for a Phase Change Memory (PCM) based system with cell endurance of 108 writes, WoLFRaM increases the memory lifetime by 68% compared to a baseline that implements the best state-of-the-art wear-leveling and fault correction mechanisms. WoLFRaM's average / worst-case performance and energy overheads are 0.51% / 3.8% and 0.47% / 2.1% respectively.
△ Less
Submitted 6 October, 2020;
originally announced October 2020.
-
BioSEAL: In-Memory Biological Sequence Alignment Accelerator for Large-Scale Genomic Data
Authors:
Roman Kaplan,
Leonid Yavits,
Ran Ginosar
Abstract:
Genome sequences contain hundreds of millions of DNA base pairs. Finding the degree of similarity between two genomes requires executing a compute-intensive dynamic programming algorithm, such as Smith-Waterman. Traditional von Neumann architectures have limited parallelism and cannot provide an efficient solution for large-scale genomic data. Approximate heuristic methods (e.g. BLAST) are commonl…
▽ More
Genome sequences contain hundreds of millions of DNA base pairs. Finding the degree of similarity between two genomes requires executing a compute-intensive dynamic programming algorithm, such as Smith-Waterman. Traditional von Neumann architectures have limited parallelism and cannot provide an efficient solution for large-scale genomic data. Approximate heuristic methods (e.g. BLAST) are commonly used. However, they are suboptimal and still compute-intensive. In this work, we present BioSEAL, a Biological SEquence ALignment accelerator. BioSEAL is a massively parallel non-von Neumann processing-in-memory architecture for large-scale DNA and protein sequence alignment. BioSEAL is based on resistive content addressable memory, capable of energy-efficient and high-performance associative processing. We present an associative processing algorithm for entire database sequence alignment on BioSEAL and compare its performance and power consumption with state-of-art solutions. We show that BioSEAL can achieve up to 57x speedup and 156x better energy efficiency, compared with existing solutions for genome sequence alignment and protein sequence database search.
△ Less
Submitted 17 January, 2019;
originally announced January 2019.
-
AIDA: Associative DNN Inference Accelerator
Authors:
Leonid Yavits,
Roman Kaplan,
Ran Ginosar
Abstract:
We propose AIDA, an inference engine for accelerating fully-connected (FC) layers of Deep Neural Network (DNN). AIDA is an associative in-memory processor, where the bulk of data never leaves the confines of the memory arrays, and processing is performed in-situ. AIDA area and energy efficiency strongly benefit from sparsity and lower arithmetic precision. We show that AIDA outperforms the state o…
▽ More
We propose AIDA, an inference engine for accelerating fully-connected (FC) layers of Deep Neural Network (DNN). AIDA is an associative in-memory processor, where the bulk of data never leaves the confines of the memory arrays, and processing is performed in-situ. AIDA area and energy efficiency strongly benefit from sparsity and lower arithmetic precision. We show that AIDA outperforms the state of art inference accelerator, EIE, by 14.5x (peak performance) and 2.5x (throughput).
△ Less
Submitted 20 December, 2018;
originally announced January 2019.
-
RASSA: Resistive Pre-Alignment Accelerator for Approximate DNA Long Read Map**
Authors:
Roman Kaplan,
Leonid Yavits,
Ran Ginosar
Abstract:
DNA read map** is a computationally expensive bioinformatics task, required for genome assembly and consensus polishing. It requires to find the best-fitting location for each DNA read on a long reference sequence. A novel resistive approximate similarity search accelerator, RASSA, exploits charge distribution and parallel in-memory processing to reflect a mismatch count between DNA sequences. R…
▽ More
DNA read map** is a computationally expensive bioinformatics task, required for genome assembly and consensus polishing. It requires to find the best-fitting location for each DNA read on a long reference sequence. A novel resistive approximate similarity search accelerator, RASSA, exploits charge distribution and parallel in-memory processing to reflect a mismatch count between DNA sequences. RASSA implementation of DNA long read pre-alignment outperforms the state-of-art solution, minimap2, by 16-77x with comparable accuracy and provides two orders of magnitude higher throughput than GateKeeper, a short-read pre-alignment hardware architecture implemented in FPGA.
△ Less
Submitted 28 January, 2019; v1 submitted 2 September, 2018;
originally announced September 2018.
-
PRINS: Resistive CAM Processing in Storage
Authors:
Leonid Yavits,
Roman Kaplan,
Ran Ginosar
Abstract:
Near-data in-storage processing research has been gaining momentum in recent years. Typical processing-in-storage architecture places a single or several processing cores inside the storage and allows data processing without transferring it to the host CPU. Since this approach replicates von Neumann architecture inside storage, it is exposed to the problems faced by von Neumann architecture, espec…
▽ More
Near-data in-storage processing research has been gaining momentum in recent years. Typical processing-in-storage architecture places a single or several processing cores inside the storage and allows data processing without transferring it to the host CPU. Since this approach replicates von Neumann architecture inside storage, it is exposed to the problems faced by von Neumann architecture, especially the bandwidth wall. We present PRINS, a novel in-data processing-in-storage architecture based on Resistive Content Addressable Memory (RCAM). PRINS functions simultaneously as a storage and a massively parallel associative processor. PRINS alleviates the bandwidth wall faced by conventional processing-in-storage architectures by kee** the computing inside the storage arrays, thus implementing in-data, rather than near-data, processing. We show that PRINS may outperform a reference computer architecture with a bandwidth-limited external storage. The performance of PRINS Euclidean distance, dot product and histogram implementation exceeds the attainable performance of a reference architecture by up to four orders of magnitude, depending on the dataset size. The performance of PRINS SpMV may exceed the attainable performance of such reference architecture by more than two orders of magnitude.
△ Less
Submitted 17 March, 2019; v1 submitted 24 May, 2018;
originally announced May 2018.
-
Sparse Matrix Multiplication on CAM Based Accelerator
Authors:
Leonid Yavits,
Ran Ginosar
Abstract:
Sparse matrix multiplication is an important component of linear algebra computations. In this paper, an architecture based on Content Addressable Memory (CAM) and Resistive Content Addressable Memory (ReCAM) is proposed for accelerating sparse matrix by sparse vector and matrix multiplication in CSR format. Using functional simulation, we show that the proposed ReCAM-based accelerator exhibits tw…
▽ More
Sparse matrix multiplication is an important component of linear algebra computations. In this paper, an architecture based on Content Addressable Memory (CAM) and Resistive Content Addressable Memory (ReCAM) is proposed for accelerating sparse matrix by sparse vector and matrix multiplication in CSR format. Using functional simulation, we show that the proposed ReCAM-based accelerator exhibits two orders of magnitude higher power efficiency as compared to existing sparse matrix-vector multiplication implementations.
△ Less
Submitted 28 May, 2017;
originally announced May 2017.
-
Sparse Matrix Multiplication On An Associative Processor
Authors:
L. Yavits,
A. Morad,
R. Ginosar
Abstract:
Sparse matrix multiplication is an important component of linear algebra computations. Implementing sparse matrix multiplication on an associative processor (AP) enables high level of parallelism, where a row of one matrix is multiplied in parallel with the entire second matrix, and where the execution time of vector dot product does not depend on the vector size. Four sparse matrix multiplication…
▽ More
Sparse matrix multiplication is an important component of linear algebra computations. Implementing sparse matrix multiplication on an associative processor (AP) enables high level of parallelism, where a row of one matrix is multiplied in parallel with the entire second matrix, and where the execution time of vector dot product does not depend on the vector size. Four sparse matrix multiplication algorithms are explored in this paper, combining AP and baseline CPU processing to various levels. They are evaluated by simulation on a large set of sparse matrices. The computational complexity of sparse matrix multiplication on AP is shown to be an O(nnz) where nnz is the number of nonzero elements. The AP is found to be especially efficient in binary sparse matrix multiplication. AP outperforms conventional solutions in power efficiency.
△ Less
Submitted 20 May, 2017;
originally announced May 2017.
-
Cache Hierarchy Optimization
Authors:
Leonid Yavits,
Amir Morad,
Ran Ginosar
Abstract:
Power consumption, off-chip memory bandwidth, chip area and Network on Chip (NoC) capacity are among main chip resources limiting the scalability of Chip Multiprocessors (CMP). A closed form analytical solution for optimizing the CMP cache hierarchy and optimally allocating area among hierarchy levels under such constrained resources is developed. The optimization framework is extended by incorpor…
▽ More
Power consumption, off-chip memory bandwidth, chip area and Network on Chip (NoC) capacity are among main chip resources limiting the scalability of Chip Multiprocessors (CMP). A closed form analytical solution for optimizing the CMP cache hierarchy and optimally allocating area among hierarchy levels under such constrained resources is developed. The optimization framework is extended by incorporating the impact of data sharing on cache miss rate. An analytical model for cache access time as a function of cache size is proposed and verified using CACTI simulation.
△ Less
Submitted 20 May, 2017;
originally announced May 2017.
-
The Effect of Temperature on Amdahl Law in 3D Multicore Era
Authors:
Leonid Yavits,
Amir Morad,
Ran Ginosar
Abstract:
This work studies the influence of temperature on performance and scalability of 3D Chip Multiprocessors (CMP) from Amdahl law perspective. We find that 3D CMP may reach its thermal limit before reaching its maximum power. We show that a high level of parallelism may lead to high peak temperatures even in small scale 3D CMPs, thus limiting 3D CMP scalability and calling for different, in-memory co…
▽ More
This work studies the influence of temperature on performance and scalability of 3D Chip Multiprocessors (CMP) from Amdahl law perspective. We find that 3D CMP may reach its thermal limit before reaching its maximum power. We show that a high level of parallelism may lead to high peak temperatures even in small scale 3D CMPs, thus limiting 3D CMP scalability and calling for different, in-memory computing architectures.
△ Less
Submitted 20 May, 2017;
originally announced May 2017.
-
MultiAmdahl: Optimal Resource Allocation in Heterogeneous Architectures
Authors:
Leonid Yavits,
Amir Morad,
Uri Weiser,
Ran Ginosar
Abstract:
Future multiprocessor chips will integrate many different units, each tailored to a specific computation. When designing such a system, the chip architect must decide how to distribute limited system resources such as area, power, and energy among the computational units. We extend MultiAmdahl, an analytical optimization technique for resource allocation in heterogeneous architectures, for energy…
▽ More
Future multiprocessor chips will integrate many different units, each tailored to a specific computation. When designing such a system, the chip architect must decide how to distribute limited system resources such as area, power, and energy among the computational units. We extend MultiAmdahl, an analytical optimization technique for resource allocation in heterogeneous architectures, for energy optimality under a variety of constant system power scenarios. We conclude that reduction in constant system power should be met by reallocating resources from general-purpose computing to heterogeneous accelerator-dominated computing, to keep the overall energy consumption at a minimum. We extend this conclusion to offer an intuition regarding energy-optimal resource allocation in data center computing.
△ Less
Submitted 19 May, 2017;
originally announced May 2017.
-
A Resistive CAM Processing-in-Storage Architecture for DNA Sequence Alignment
Authors:
Roman Kaplan,
Leonid Yavits,
Ran Ginosar,
Uri Weiser
Abstract:
A novel processing-in-storage (PRinS) architecture based on Resistive CAM (ReCAM) is described and proposed for Smith-Waterman (S-W) sequence alignment. The ReCAM massively-parallel compare operation finds matching base-pairs in a fixed number of cycles, regardless of sequence length. The ReCAM PRinS S-W algorithm is simulated and compared to FPGA, Xeon Phi and GPU-based implementations, showing a…
▽ More
A novel processing-in-storage (PRinS) architecture based on Resistive CAM (ReCAM) is described and proposed for Smith-Waterman (S-W) sequence alignment. The ReCAM massively-parallel compare operation finds matching base-pairs in a fixed number of cycles, regardless of sequence length. The ReCAM PRinS S-W algorithm is simulated and compared to FPGA, Xeon Phi and GPU-based implementations, showing at least 4.7x higher throughput and at least 15x lower power dissipation.
△ Less
Submitted 11 June, 2017; v1 submitted 17 January, 2017;
originally announced January 2017.
-
Effect of Data Sharing on Private Cache Design in Chip Multiprocessors
Authors:
Leonid Yavits,
Amir Morad,
Ran Ginosar
Abstract:
In multithreaded applications with high degree of data sharing, the miss rate of private cache is shown to exhibit a compulsory miss component. It manifests because at least some of the shared data originates from other cores and can only be accessed in a shared cache. The compulsory component does not change with the private cache size, causing its miss rate to diminish slower as the cache size g…
▽ More
In multithreaded applications with high degree of data sharing, the miss rate of private cache is shown to exhibit a compulsory miss component. It manifests because at least some of the shared data originates from other cores and can only be accessed in a shared cache. The compulsory component does not change with the private cache size, causing its miss rate to diminish slower as the cache size grows. As a result, the peak performance of a Chip Multiprocessor (CMP) for workloads with high degree of data sharing is achieved with a smaller private cache, compared to workloads with no data sharing. The CMP performance can be improved by reassigning some of the constrained area or power resource from private cache to core. Alternatively, the area or power budget of a CMP can be reduced without a performance hit.
△ Less
Submitted 3 February, 2016;
originally announced February 2016.
-
Convex Optimization of Real Time SoC
Authors:
L. Yavits,
A. Morad,
R. Ginosar,
U. Weiser
Abstract:
Convex optimization methods are employed to optimize a real-time (RT) system-on-chip (SoC) under a variety of physical resource-driven constraints, demonstrated on an industry MPEG2 encoder SoC. The power optimization is compared to conventional performance-optimization framework, showing a factor of two and a half saving in power. Convex optimization is shown to be very efficient in a high-level…
▽ More
Convex optimization methods are employed to optimize a real-time (RT) system-on-chip (SoC) under a variety of physical resource-driven constraints, demonstrated on an industry MPEG2 encoder SoC. The power optimization is compared to conventional performance-optimization framework, showing a factor of two and a half saving in power. Convex optimization is shown to be very efficient in a high-level early stage design exploration, guiding computer architects as to the choice of area, voltage, and frequency of the individual components of the Chip Multiprocessor (CMP).
△ Less
Submitted 19 May, 2017; v1 submitted 28 January, 2016;
originally announced January 2016.
-
3D Cache Hierarchy Optimization
Authors:
Leonid Yavits,
Amir Morad,
Ran Ginosar
Abstract:
3D integration has the potential to improve the scalability and performance of Chip Multiprocessors (CMP). A closed form analytical solution for optimizing 3D CMP cache hierarchy is developed. It allows optimal partitioning of the cache hierarchy levels into 3D silicon layers and optimal allocation of area among cache hierarchy levels under constrained area and power budgets. The optimization fram…
▽ More
3D integration has the potential to improve the scalability and performance of Chip Multiprocessors (CMP). A closed form analytical solution for optimizing 3D CMP cache hierarchy is developed. It allows optimal partitioning of the cache hierarchy levels into 3D silicon layers and optimal allocation of area among cache hierarchy levels under constrained area and power budgets. The optimization framework is extended by incorporating the impact of multithreaded data sharing on the private cache miss rate. An analytical model for cache access time as a function of cache size and a number of 3D partitions is proposed and verified using CACTI simulation.
△ Less
Submitted 7 November, 2013;
originally announced November 2013.
-
Thermal analysis of 3D associative processor
Authors:
Leonid Yavits,
Amir Morad,
Ran Ginosar
Abstract:
Thermal density and hot spots limit three-dimensional (3D) implementation of massively-parallel SIMD processors and prohibit stacking DRAM dies above them. This study proposes replacing SIMD by an Associative Processor (AP). AP exhibits close to uniform thermal distribution with reduced hot spots. Additionally, AP may outperform SIMD processor when the data set size is sufficiently large, while di…
▽ More
Thermal density and hot spots limit three-dimensional (3D) implementation of massively-parallel SIMD processors and prohibit stacking DRAM dies above them. This study proposes replacing SIMD by an Associative Processor (AP). AP exhibits close to uniform thermal distribution with reduced hot spots. Additionally, AP may outperform SIMD processor when the data set size is sufficiently large, while dissipating less power. Comparative performance and thermal analysis supported by simulation confirm that AP might be preferable over SIMD for 3D implementation of large scale massively parallel processing engines combined with 3D DRAM integration.
△ Less
Submitted 15 July, 2013;
originally announced July 2013.
-
The Effect of Communication and Synchronization on Amdahl Law in Multicore Systems
Authors:
Leonid Yavits,
Amir Morad,
Ran Ginosar
Abstract:
This work analyses the effects of sequential-to-parallel synchronization and inter-core communication on multicore performance, speedup and scaling. A modification of Amdahl law is formulated, to reflect the finding that parallel speedup is lower than originally predicted, due to these effects. In applications with high inter-core communication requirements, the workload should be executed on a sm…
▽ More
This work analyses the effects of sequential-to-parallel synchronization and inter-core communication on multicore performance, speedup and scaling. A modification of Amdahl law is formulated, to reflect the finding that parallel speedup is lower than originally predicted, due to these effects. In applications with high inter-core communication requirements, the workload should be executed on a small number of cores, and applications of high sequential-to-parallel synchronization requirements may better be executed by the sequential core, even when f, the Amdahl fraction of parallelization, is very close to 1. To improve the scalability and performance speedup of a multicore, it is as important to address the synchronization and connectivity intensities of parallel algorithms as their parallelization factor.
△ Less
Submitted 14 June, 2013;
originally announced June 2013.
-
Computer Architecture with Associative Processor Replacing Last Level Cache and SIMD Accelerator
Authors:
Leonid Yavits,
Amir Morad,
Ran Ginosar
Abstract:
This study presents a novel computer architecture where a last level cache and a SIMD accelerator are replaced by an Associative Processor. Associative Processor combines data storage and data processing and provides parallel computational capabilities and data memory at the same time. An analytic performance model of the new computer architecture is introduced. Comparative analysis supported by s…
▽ More
This study presents a novel computer architecture where a last level cache and a SIMD accelerator are replaced by an Associative Processor. Associative Processor combines data storage and data processing and provides parallel computational capabilities and data memory at the same time. An analytic performance model of the new computer architecture is introduced. Comparative analysis supported by simulation shows that this novel architecture may outperform a conventional architecture comprising a SIMD coprocessor and a shared last level cache while consuming less power.
△ Less
Submitted 8 November, 2013; v1 submitted 13 June, 2013;
originally announced June 2013.