-
ProtoX: A First Look
Authors:
Het Mankad,
Sanil Rao,
Brian Van Straalen,
Phillip Colella,
Franz Franchetti
Abstract:
We present a first look at ProtoX, a code generation framework for stencil and pointwise operations that occur frequently in the numerical solution of partial differential equations. ProtoX has Proto as its library frontend and SPIRAL as the backend. Proto is a C++ based domain specific library which optimizes the algorithms used to compute the numerical solution of partial differential equations.…
▽ More
We present a first look at ProtoX, a code generation framework for stencil and pointwise operations that occur frequently in the numerical solution of partial differential equations. ProtoX has Proto as its library frontend and SPIRAL as the backend. Proto is a C++ based domain specific library which optimizes the algorithms used to compute the numerical solution of partial differential equations. Meanwhile, SPIRAL is a code generation system that focuses on generating highly optimized target code. Although the current design layout of Proto and its high level of abstractions provide a user friendly set up, there is still a room for improving it's performance by applying various techniques either at a compiler level or at an algorithmic level. Hence, in this paper we propose adding SPIRAL as the library backend for Proto enabling abstraction fusion, which is usually difficult to perform by any compiler. We demonstrate the construction of ProtoX by considering the 2D Poisson equation as a model problem from Proto. We provide the final generated code for CPU, Multi-core CPU, and GPU as well as some performance numbers for CPU.
△ Less
Submitted 15 July, 2023;
originally announced July 2023.
-
TREBUCHET: Fully Homomorphic Encryption Accelerator for Deep Computation
Authors:
David Bruce Cousins,
Yuriy Polyakov,
Ahmad Al Badawi,
Matthew French,
Andrew Schmidt,
Ajey Jacob,
Benedict Reynwar,
Kellie Canida,
Akhilesh Jaiswal,
Clynn Mathew,
Homer Gamil,
Negar Neda,
Deepraj Soni,
Michail Maniatakos,
Brandon Reagen,
Naifeng Zhang,
Franz Franchetti,
Patrick Brinich,
Jeremy Johnson,
Patrick Broderick,
Mike Franusich,
Bo Zhang,
Zeming Cheng,
Massoud Pedram
Abstract:
Secure computation is of critical importance to not only the DoD, but across financial institutions, healthcare, and anywhere personally identifiable information (PII) is accessed. Traditional security techniques require data to be decrypted before performing any computation. When processed on untrusted systems the decrypted data is vulnerable to attacks to extract the sensitive information. To ad…
▽ More
Secure computation is of critical importance to not only the DoD, but across financial institutions, healthcare, and anywhere personally identifiable information (PII) is accessed. Traditional security techniques require data to be decrypted before performing any computation. When processed on untrusted systems the decrypted data is vulnerable to attacks to extract the sensitive information. To address these vulnerabilities Fully Homomorphic Encryption (FHE) keeps the data encrypted during computation and secures the results, even in these untrusted environments. However, FHE requires a significant amount of computation to perform equivalent unencrypted operations. To be useful, FHE must significantly close the computation gap (within 10x) to make encrypted processing practical. To accomplish this ambitious goal the TREBUCHET project is leading research and development in FHE processing hardware to accelerate deep computations on encrypted data, as part of the DARPA MTO Data Privacy for Virtual Environments (DPRIVE) program. We accelerate the major secure standardized FHE schemes (BGV, BFV, CKKS, FHEW, etc.) at >=128-bit security while integrating with the open-source PALISADE and OpenFHE libraries currently used in the DoD and in industry. We utilize a novel tile-based chip design with highly parallel ALUs optimized for vectorized 128b modulo arithmetic. The TREBUCHET coprocessor design provides a highly modular, flexible, and extensible FHE accelerator for easy reconfiguration, deployment, integration and application on other hardware form factors, such as System-on-Chip or alternate chip areas.
△ Less
Submitted 18 April, 2023; v1 submitted 11 April, 2023;
originally announced April 2023.
-
RPU: The Ring Processing Unit
Authors:
Deepraj Soni,
Negar Neda,
Naifeng Zhang,
Benedict Reynwar,
Homer Gamil,
Benjamin Heyman,
Mohammed Nabeel,
Ahmad Al Badawi,
Yuriy Polyakov,
Kellie Canida,
Massoud Pedram,
Michail Maniatakos,
David Bruce Cousins,
Franz Franchetti,
Matthew French,
Andrew Schmidt,
Brandon Reagen
Abstract:
Ring-Learning-with-Errors (RLWE) has emerged as the foundation of many important techniques for improving security and privacy, including homomorphic encryption and post-quantum cryptography. While promising, these techniques have received limited use due to their extreme overheads of running on general-purpose machines. In this paper, we present a novel vector Instruction Set Architecture (ISA) a…
▽ More
Ring-Learning-with-Errors (RLWE) has emerged as the foundation of many important techniques for improving security and privacy, including homomorphic encryption and post-quantum cryptography. While promising, these techniques have received limited use due to their extreme overheads of running on general-purpose machines. In this paper, we present a novel vector Instruction Set Architecture (ISA) and microarchitecture for accelerating the ring-based computations of RLWE. The ISA, named B512, is developed to meet the needs of ring processing workloads while balancing high-performance and general-purpose programming support. Having an ISA rather than fixed hardware facilitates continued software improvement post-fabrication and the ability to support the evolving workloads. We then propose the ring processing unit (RPU), a high-performance, modular implementation of B512. The RPU has native large word modular arithmetic support, capabilities for very wide parallel processing, and a large capacity high-bandwidth scratchpad to meet the needs of ring processing. We address the challenges of programming the RPU using a newly developed SPIRAL backend. A configurable simulator is built to characterize design tradeoffs and quantify performance. The best performing design was implemented in RTL and used to validate simulator performance. In addition to our characterization, we show that a RPU using 20.5mm2 of GF 12nm can provide a speedup of 1485x over a CPU running a 64k, 128-bit NTT, a core RLWE workload
△ Less
Submitted 13 April, 2023; v1 submitted 29 March, 2023;
originally announced March 2023.
-
Indirect Measurement of Hepatic Drug Clearance by Fitting Dynamical Models
Authors:
Yoko Franchetti,
Thomas D. Nolin,
Franz Franchetti
Abstract:
We present an indirect signal processing-based measurement method for biological quantities in humans that cannot be directly measured. We develop the method by focusing on estimating hepatic enzyme and drug transporter activity through breath-biopsy samples clinically obtained via the erythromycin breath test (EBT): a small dose of radio-labeled drug is injected and the subsequent content of radi…
▽ More
We present an indirect signal processing-based measurement method for biological quantities in humans that cannot be directly measured. We develop the method by focusing on estimating hepatic enzyme and drug transporter activity through breath-biopsy samples clinically obtained via the erythromycin breath test (EBT): a small dose of radio-labeled drug is injected and the subsequent content of radio-labeled CO$_2$ is measured repeatedly in exhaled breath; the resulting time series is analyzed. To model EBT we developed a 14-variable non-linear reduced order dynamical model that describes the behavior of the drug and its metabolites in the human body well enough to capture all biological phenomena of interest. Based on this system of coupled non-linear ordinary differential equations (ODEs) we treat the measurement problem as inverse problem: we estimate the ODE parameters of individual patients from the measured EBT time series. These estimates then provide a measurement of the liver activity of interest. The parameters are hard to estimate as the ODEs are stiff and the problem needs to be regularized to ensure stable convergence. We develop a formal operator framework to capture and treat the specific non-linearities present, and perform perturbation analysis to establish properties of the estimation procedure and its solution. Development of the method required 150,000 CPU hours at a supercomputing center, and a single production run takes CPU 24 hours. We introduce and analyze the method in the context of future precision dosing of drugs for vulnerable patients (e.g., oncology, nephrology, or pediatrics) to eventually ensure efficacy and avoid toxicity.
△ Less
Submitted 31 December, 2020;
originally announced December 2020.
-
A Flexible Framework for Parallel Multi-Dimensional DFTs
Authors:
Doru Thom Popovici,
Martin D. Schatz,
Franz Franchetti,
Tze Meng Low
Abstract:
Multi-dimensional discrete Fourier transforms (DFT) are typically decomposed into multiple 1D transforms. Hence, parallel implementations of any multi-dimensional DFT focus on parallelizing within or across the 1D DFT. Existing DFT packages exploit the inherent parallelism across the 1D DFTs and offer rigid frameworks, that cannot be extended to incorporate both forms of parallelism and various da…
▽ More
Multi-dimensional discrete Fourier transforms (DFT) are typically decomposed into multiple 1D transforms. Hence, parallel implementations of any multi-dimensional DFT focus on parallelizing within or across the 1D DFT. Existing DFT packages exploit the inherent parallelism across the 1D DFTs and offer rigid frameworks, that cannot be extended to incorporate both forms of parallelism and various data layouts to enable some of the parallelism. However, in the era of exascale, where systems have thousand of nodes and intricate network topologies, flexibility and parallel efficiency are key aspects all multi-dimensional DFT frameworks need to have in order to map and scale the computation appropriately. In this work, we present a flexible framework, built on the Redistribution Operations and Tensor Expressions (ROTE) framework, that facilitates the development of a family of parallel multi-dimensional DFT algorithms by 1) unifying the two parallelization schemes within a single framework, 2) exploiting the two different parallelization schemes to different degrees and 3) using different data layouts to distribute the data across the compute nodes. We demonstrate the need of a versatile framework and thus a need for a family of parallel multi-dimensional DFT algorithms on the K-Computer, where we show almost linear strong scaling results for problem sizes of 1024^3 on 32k compute nodes.
△ Less
Submitted 22 December, 2019; v1 submitted 22 April, 2019;
originally announced April 2019.
-
Fast and accurate object detection in high resolution 4K and 8K video using GPUs
Authors:
Vít Růžička,
Franz Franchetti
Abstract:
Machine learning has celebrated a lot of achievements on computer vision tasks such as object detection, but the traditionally used models work with relatively low resolution images. The resolution of recording devices is gradually increasing and there is a rising need for new methods of processing high resolution data. We propose an attention pipeline method which uses two staged evaluation of ea…
▽ More
Machine learning has celebrated a lot of achievements on computer vision tasks such as object detection, but the traditionally used models work with relatively low resolution images. The resolution of recording devices is gradually increasing and there is a rising need for new methods of processing high resolution data. We propose an attention pipeline method which uses two staged evaluation of each image or video frame under rough and refined resolution to limit the total number of necessary evaluations. For both stages, we make use of the fast object detection model YOLO v2. We have implemented our model in code, which distributes the work across GPUs. We maintain high accuracy while reaching the average performance of 3-6 fps on 4K video and 2 fps on 8K video.
△ Less
Submitted 24 October, 2018;
originally announced October 2018.
-
High Performance Zero-Memory Overhead Direct Convolutions
Authors:
Jiyuan Zhang,
Franz Franchetti,
Tze Meng Low
Abstract:
The computation of convolution layers in deep neural networks typically rely on high performance routines that trade space for time by using additional memory (either for packing purposes or required as part of the algorithm) to improve performance. The problems with such an approach are two-fold. First, these routines incur additional memory overhead which reduces the overall size of the network…
▽ More
The computation of convolution layers in deep neural networks typically rely on high performance routines that trade space for time by using additional memory (either for packing purposes or required as part of the algorithm) to improve performance. The problems with such an approach are two-fold. First, these routines incur additional memory overhead which reduces the overall size of the network that can fit on embedded devices with limited memory capacity. Second, these high performance routines were not optimized for performing convolution, which means that the performance obtained is usually less than conventionally expected. In this paper, we demonstrate that direct convolution, when implemented correctly, eliminates all memory overhead, and yields performance that is between 10% to 400% times better than existing high performance implementations of convolution layers on conventional and embedded CPU architectures. We also show that a high performance direct convolution exhibits better scaling performance, i.e. suffers less performance drop, when increasing the number of threads.
△ Less
Submitted 19 September, 2018;
originally announced September 2018.
-
Automating the Last-Mile for High Performance Dense Linear Algebra
Authors:
Richard Michael Veras,
Tze Meng Low,
Tyler Michael Smith,
Robert van de Geijn,
Franz Franchetti
Abstract:
High performance dense linear algebra (DLA) libraries often rely on a general matrix multiply (Gemm) kernel that is implemented using assembly or with vector intrinsics. In particular, the real-valued Gemm kernels provide the overwhelming fraction of performance for the complex-valued Gemm kernels, along with the entire level-3 BLAS and many of the real and complex LAPACK routines. Thus,achieving…
▽ More
High performance dense linear algebra (DLA) libraries often rely on a general matrix multiply (Gemm) kernel that is implemented using assembly or with vector intrinsics. In particular, the real-valued Gemm kernels provide the overwhelming fraction of performance for the complex-valued Gemm kernels, along with the entire level-3 BLAS and many of the real and complex LAPACK routines. Thus,achieving high performance for the Gemm kernel translates into a high performance linear algebra stack above this kernel. However, it is a monumental task for a domain expert to manually implement the kernel for every library-supported architecture. This leads to the belief that the craft of a Gemm kernel is more dark art than science. It is this premise that drives the popularity of autotuning with code generation in the domain of DLA.
This paper, instead, focuses on an analytical approach to code generation of the Gemm kernel for different architecture, in order to shed light on the details or voo-doo required for implementing a high performance Gemm kernel. We distill the implementation of the kernel into an even smaller kernel, an outer-product, and analytically determine how available SIMD instructions can be used to compute the outer-product efficiently. We codify this approach into a system to automatically generate a high performance SIMD implementation of the Gemm kernel. Experimental results demonstrate that our approach yields generated kernels with performance that is competitive with kernels implemented manually or using empirical search.
△ Less
Submitted 28 April, 2017; v1 submitted 23 November, 2016;
originally announced November 2016.
-
Mathematical Foundations of the GraphBLAS
Authors:
Jeremy Kepner,
Peter Aaltonen,
David Bader,
Aydın Buluc,
Franz Franchetti,
John Gilbert,
Dylan Hutchison,
Manoj Kumar,
Andrew Lumsdaine,
Henning Meyerhenke,
Scott McMillan,
Jose Moreira,
John D. Owens,
Carl Yang,
Marcin Zalewski,
Timothy Mattson
Abstract:
The GraphBLAS standard (GraphBlas.org) is being developed to bring the potential of matrix based graph algorithms to the broadest possible audience. Mathematically the Graph- BLAS defines a core set of matrix-based graph operations that can be used to implement a wide class of graph algorithms in a wide range of programming environments. This paper provides an introduction to the mathematics of th…
▽ More
The GraphBLAS standard (GraphBlas.org) is being developed to bring the potential of matrix based graph algorithms to the broadest possible audience. Mathematically the Graph- BLAS defines a core set of matrix-based graph operations that can be used to implement a wide class of graph algorithms in a wide range of programming environments. This paper provides an introduction to the mathematics of the GraphBLAS. Graphs represent connections between vertices with edges. Matrices can represent a wide range of graphs using adjacency matrices or incidence matrices. Adjacency matrices are often easier to analyze while incidence matrices are often better for representing data. Fortunately, the two are easily connected by matrix mul- tiplication. A key feature of matrix mathematics is that a very small number of matrix operations can be used to manipulate a very wide range of graphs. This composability of small number of operations is the foundation of the GraphBLAS. A standard such as the GraphBLAS can only be effective if it has low performance overhead. Performance measurements of prototype GraphBLAS implementations indicate that the overhead is low.
△ Less
Submitted 13 July, 2016; v1 submitted 18 June, 2016;
originally announced June 2016.
-
An Information-Theoretic Approach to PMU Placement in Electric Power Systems
Authors:
Qiao Li,
Tao Cui,
Yang Weng,
Rohit Negi,
Franz Franchetti,
Marija D. Ilic
Abstract:
This paper presents an information-theoretic approach to address the phasor measurement unit (PMU) placement problem in electric power systems. Different from the conventional 'topological observability' based approaches, this paper advocates a much more refined, information-theoretic criterion, namely the mutual information (MI) between the PMU measurements and the power system states. The propos…
▽ More
This paper presents an information-theoretic approach to address the phasor measurement unit (PMU) placement problem in electric power systems. Different from the conventional 'topological observability' based approaches, this paper advocates a much more refined, information-theoretic criterion, namely the mutual information (MI) between the PMU measurements and the power system states. The proposed MI criterion can not only include the full system observability as a special case, but also can rigorously model the remaining uncertainties in the power system states with PMU measurements, so as to generate highly informative PMU configurations. Further, the MI criterion can facilitate robust PMU placement by explicitly modeling probabilistic PMU outages. We propose a greedy PMU placement algorithm, and show that it achieves an approximation ratio of (1-1/e) for any PMU placement budget. We further show that the performance is the best that one can achieve in practice, in the sense that it is NP-hard to achieve any approximation ratio beyond (1-1/e). Such performance guarantee makes the greedy algorithm very attractive in the practical scenario of multi-stage installations for utilities with limited budgets. Finally, simulation results demonstrate near-optimal performance of the proposed PMU placement algorithm.
△ Less
Submitted 13 January, 2012;
originally announced January 2012.
-
On-line Decentralized Charging of Plug-In Electric Vehicles in Power Systems
Authors:
Qiao Li,
Tao Cui,
Rohit Negi,
Franz Franchetti,
Marija D. Ilic
Abstract:
The concept of plug-in electric vehicles (PEV) are gaining increasing popularity in recent years, due to the growing societal awareness of reducing greenhouse gas (GHG) emissions, and gaining independence on foreign oil or petroleum. Large-scale deployment of PEVs currently faces many challenges. One particular concern is that the PEV charging can potentially cause significant impacts on the exist…
▽ More
The concept of plug-in electric vehicles (PEV) are gaining increasing popularity in recent years, due to the growing societal awareness of reducing greenhouse gas (GHG) emissions, and gaining independence on foreign oil or petroleum. Large-scale deployment of PEVs currently faces many challenges. One particular concern is that the PEV charging can potentially cause significant impacts on the existing power distribution system, due to the increase in peak load. As such, this work tries to mitigate the impacts of PEV charging by proposing a decentralized smart PEV charging algorithm to minimize the distribution system load variance, so that a `flat' total load profile can be obtained. The charging algorithm is myopic, in that it controls the PEV charging processes in each time slot based entirely on the current power system states, without knowledge about future system dynamics. We provide theoretical guarantees on the asymptotic optimality of the proposed charging algorithm. Thus, compared to other forecast based smart charging approaches in the literature, the charging algorithm not only achieves optimality asymptotically in an on-line, and decentralized manner, but also is robust against various uncertainties in the power system, such as random PEV driving patterns and distributed generation (DG) with highly intermittent renewable energy sources.
△ Less
Submitted 18 November, 2011; v1 submitted 24 June, 2011;
originally announced June 2011.