-
An Irredundant and Compressed Data Layout to Optimize Bandwidth Utilization of FPGA Accelerators
Authors:
Corentin Ferry,
Nicolas Derumigny,
Steven Derrien,
Sanjay Rajopadhye
Abstract:
Memory bandwidth is known to be a performance bottleneck for FPGA accelerators, especially when they deal with large multi-dimensional data-sets. A large body of work focuses on reducing of off-chip transfers, but few authors try to improve the efficiency of transfers. This paper addresses the later issue by proposing (i) a compiler-based approach to accelerator's data layout to maximize contiguou…
▽ More
Memory bandwidth is known to be a performance bottleneck for FPGA accelerators, especially when they deal with large multi-dimensional data-sets. A large body of work focuses on reducing of off-chip transfers, but few authors try to improve the efficiency of transfers. This paper addresses the later issue by proposing (i) a compiler-based approach to accelerator's data layout to maximize contiguous access to off-chip memory, and (ii) data packing and runtime compression techniques that take advantage of this layout to further improve memory performance. We show that our approach can decrease the I/O cycles up to $7\times$ compared to un-optimized memory accesses.
△ Less
Submitted 22 January, 2024;
originally announced January 2024.
-
An Irredundant Decomposition of Data Flow with Affine Dependences
Authors:
Corentin Ferry,
Steven Derrien,
Sanjay Rajopadhye
Abstract:
Optimization pipelines targeting polyhedral programs try to maximize the compute throughput. Traditional approaches favor reuse and temporal locality; while the communicated volume can be low, failure to optimize spatial locality may cause a low I/O performance.
Memory allocation schemes using data partitioning such as data tiling can improve the spatial locality, but they are domain-specific an…
▽ More
Optimization pipelines targeting polyhedral programs try to maximize the compute throughput. Traditional approaches favor reuse and temporal locality; while the communicated volume can be low, failure to optimize spatial locality may cause a low I/O performance.
Memory allocation schemes using data partitioning such as data tiling can improve the spatial locality, but they are domain-specific and rarely applied by compilers when an existing allocation is supplied.
In this paper, we propose to derive a partitioned memory allocation for tiled polyhedral programs using their data flow information. We extend the existing MARS partitioning to handle affine dependences, and determine which dependences can lead to a regular, simple control flow for communications.
While this paper consists in a theoretical study, previous work on data partitioning in inter-node scenarios has shown performance improvements due to better bandwidth utilization.
△ Less
Submitted 6 December, 2023;
originally announced December 2023.
-
Maximal Simplification of Polyhedral Reductions
Authors:
Louis Narmour,
Tomofumi Yuki,
Sanjay Rajopadhye
Abstract:
Reductions combine multiple input values with an associative operator to produce a single (or multiple) result(s). When the same input value contributes to multiple outputs, there is an opportunity to reuse partial results, enabling reduction simplification. Simplification produces a program with lower asymptotic complexity. It is well known that reductions in polyhedral programs may be simplified…
▽ More
Reductions combine multiple input values with an associative operator to produce a single (or multiple) result(s). When the same input value contributes to multiple outputs, there is an opportunity to reuse partial results, enabling reduction simplification. Simplification produces a program with lower asymptotic complexity. It is well known that reductions in polyhedral programs may be simplified automatically but previous methods are incapable of exploiting all available reuse. This paper resolves this long standing open problem, thereby attaining minimal asymptotic complexity in the simplified program.
△ Less
Submitted 21 September, 2023;
originally announced September 2023.
-
Maximal Atomic irRedundant Sets: a Usage-based Dataflow Partitioning Algorithm
Authors:
Corentin Ferry,
Steven Derrien,
Sanjay Rajopadhye
Abstract:
Programs admitting a polyhedral representation can be transformed in many ways for locality and parallelism, notably loop tiling. Data flow analysis can then compute dependence relations between iterations and between tiles. When tiling is applied, certain iteration-wise dependences cross tile boundaries, creating the need for inter-tile data communication. Previous work computes it as the flow-in…
▽ More
Programs admitting a polyhedral representation can be transformed in many ways for locality and parallelism, notably loop tiling. Data flow analysis can then compute dependence relations between iterations and between tiles. When tiling is applied, certain iteration-wise dependences cross tile boundaries, creating the need for inter-tile data communication. Previous work computes it as the flow-in and flow-out sets of iteration tiles.
In this paper, we propose a partitioning of the flow-out of a tile into the maximal sets of iterations that are entirely consumed and incur no redundant storage or transfer. The computation is described as an algorithm and performed on a selection of polyhedral programs. We then suggest possible applications of this decomposition in compression and memory allocation.
△ Less
Submitted 29 November, 2022;
originally announced November 2022.
-
Distributed non-negative RESCAL with Automatic Model Selection for Exascale Data
Authors:
Manish Bhattarai,
Namita Kharat,
Erik Skau,
Benjamin Nebgen,
Hristo Djidjev,
Sanjay Rajopadhye,
James P. Smith,
Boian Alexandrov
Abstract:
With the boom in the development of computer hardware and software, social media, IoT platforms, and communications, there has been an exponential growth in the volume of data produced around the world. Among these data, relational datasets are growing in popularity as they provide unique insights regarding the evolution of communities and their interactions. Relational datasets are naturally non-…
▽ More
With the boom in the development of computer hardware and software, social media, IoT platforms, and communications, there has been an exponential growth in the volume of data produced around the world. Among these data, relational datasets are growing in popularity as they provide unique insights regarding the evolution of communities and their interactions. Relational datasets are naturally non-negative, sparse, and extra-large. Relational data usually contain triples, (subject, relation, object), and are represented as graphs/multigraphs, called knowledge graphs, which need to be embedded into a low-dimensional dense vector space. Among various embedding models, RESCAL allows learning of relational data to extract the posterior distributions over the latent variables and to make predictions of missing relations. However, RESCAL is computationally demanding and requires a fast and distributed implementation to analyze extra-large real-world datasets. Here we introduce a distributed non-negative RESCAL algorithm for heterogeneous CPU/GPU architectures with automatic selection of the number of latent communities (model selection), called pyDRESCALk. We demonstrate the correctness of pyDRESCALk with real-world and large synthetic tensors, and the efficacy showing near-linear scaling that concurs with the theoretical complexities. Finally, pyDRESCALk determines the number of latent communities in an 11-terabyte dense and 9-exabyte sparse synthetic tensor.
△ Less
Submitted 18 February, 2022;
originally announced February 2022.
-
Increasing FPGA Accelerators Memory Bandwidth with a Burst-Friendly Memory Layout
Authors:
Corentin Ferry,
Tomofumi Yuki,
Steven Derrien,
Sanjay Rajopadhye
Abstract:
Offloading compute-intensive kernels to hardware accelerators relies on the large degree of parallelism offered by these platforms. However, the effective bandwidth of the memory interface often causes a bottleneck, hindering the accelerator's effective performance. Techniques enabling data reuse, such as tiling, lower the pressure on memory traffic but still often leave the accelerators I/O-bound…
▽ More
Offloading compute-intensive kernels to hardware accelerators relies on the large degree of parallelism offered by these platforms. However, the effective bandwidth of the memory interface often causes a bottleneck, hindering the accelerator's effective performance. Techniques enabling data reuse, such as tiling, lower the pressure on memory traffic but still often leave the accelerators I/O-bound. A further increase in effective bandwidth is possible by using burst rather than element-wise accesses, provided the data is contiguous in memory.
In this paper, we propose a memory allocation technique, and provide a proof-of-concept source-to-source compiler pass, that enables such burst transfers by modifying the data layout in external memory. We assess how this technique pushes up the memory throughput, leaving room for exploiting additional parallelism, for a minimal logic overhead.
△ Less
Submitted 23 February, 2022; v1 submitted 11 February, 2022;
originally announced February 2022.
-
On Simplifying Dependent Polyhedral Reductions
Authors:
Sanjay Rajopadhye
Abstract:
\emph{Reductions} combine collections of input values with an associative (and usually also commutative) operator to produce either a single, or a collection of outputs. They are ubiquitous in computing, especially with big data and deep learning. When the \emph{same} input value contributes to multiple output values, there is a tremendous opportunity for reducing (pun intended) the computational…
▽ More
\emph{Reductions} combine collections of input values with an associative (and usually also commutative) operator to produce either a single, or a collection of outputs. They are ubiquitous in computing, especially with big data and deep learning. When the \emph{same} input value contributes to multiple output values, there is a tremendous opportunity for reducing (pun intended) the computational effort. This is called \emph{simplification}. \emph{Polyhedral reductions} are reductions where the input and output data collections are (dense) multidimensional arrays (i.e., \emph{tensors}), accessed with linear/affine functions of the indices.
% \emph{generalized tensor contractions} Gautam and Rajopadhye \cite{sanjay-popl06} showed how polyhedral reductions could be simplified automatically (through compile time analysis) and optimally (the resulting program had minimum asymptotic complexity). Yang, Atkinson and Carbin \cite{yang2020simplifying} extended this to the case when (some) input values depend on (some) outputs. Specifically, they showed how the optimal simplification problem could be formulated as a bilinear programming problem, and for the case when the reduction operator admits an inverse, they gave a heuristic solution that retained optimality.
In this note, we show that simplification of dependent reductions can be formulated as a simple extension of the Gautam-Rajopadhye backtracking search algorithm.
△ Less
Submitted 6 October, 2020;
originally announced October 2020.
-
LLOV: A Fast Static Data-Race Checker for OpenMP Programs
Authors:
Utpal Bora,
Santanu Das,
Pankaj Kukreja,
Saurabh Joshi,
Ramakrishna Upadrasta,
Sanjay Rajopadhye
Abstract:
In the era of Exascale computing, writing efficient parallel programs is indispensable and at the same time, writing sound parallel programs is very difficult. Specifying parallelism with frameworks such as OpenMP is relatively easy, but data races in these programs are an important source of bugs. In this paper, we propose LLOV, a fast, lightweight, language agnostic, and static data race checker…
▽ More
In the era of Exascale computing, writing efficient parallel programs is indispensable and at the same time, writing sound parallel programs is very difficult. Specifying parallelism with frameworks such as OpenMP is relatively easy, but data races in these programs are an important source of bugs. In this paper, we propose LLOV, a fast, lightweight, language agnostic, and static data race checker for OpenMP programs based on the LLVM compiler framework. We compare LLOV with other state-of-the-art data race checkers on a variety of well-established benchmarks. We show that the precision, accuracy, and the F1 score of LLOV is comparable to other checkers while being orders of magnitude faster. To the best of our knowledge, LLOV is the only tool among the state-of-the-art data race checkers that can verify a C/C++ or FORTRAN program to be data race free.
△ Less
Submitted 1 September, 2020; v1 submitted 27 December, 2019;
originally announced December 2019.
-
BPPart and BPMax: RNA-RNA Interaction Partition Function and Structure Prediction for the Base Pair Counting Model
Authors:
Ali Ebrahimpour-Boroojeny,
Sanjay Rajopadhye,
Hamidreza Chitsaz
Abstract:
RNA-RNA interaction (RRI) is ubiquitous and has complex roles in the cellular functions. In human health studies, miRNA-target and lncRNAs are among an elite class of RRIs that have been extensively studied. Bacterial ncRNA-target and RNA interference are other classes of RRIs that have received significant attention. In recent studies, mRNA-mRNA interaction instances have been observed, where bot…
▽ More
RNA-RNA interaction (RRI) is ubiquitous and has complex roles in the cellular functions. In human health studies, miRNA-target and lncRNAs are among an elite class of RRIs that have been extensively studied. Bacterial ncRNA-target and RNA interference are other classes of RRIs that have received significant attention. In recent studies, mRNA-mRNA interaction instances have been observed, where both partners appear in the same pathway without any direct link between them, or any prior knowledge about their relationship. Those recently discovered cases suggest that RRI scope is much wider than those aforementioned elite classes.
We revisit our RNA-RNA interaction partition function algorithm, piRNA, which computes the partition function, base-pairing probabilities, and structure for the comprehensive Turner energy model using 96 different dynamic programming tables. In this study, we strategically retreat from sophisticated thermodynamic models to the much simpler base pair counting model. That might seem counter-intuitive at the first glance; our idea is to benefit from the advantages of such simple models in terms of running time and memory footprint and compensate for the associated information loss by adding machine learning components in the future.
Here, simple weighted base pair counting is considered to obtain BPPart for Base-pair Partition function and BPMax for Base-pair Maximization, which use 9 and 2 tables respectively. They are empirically 225 and 1350 fold faster than piRNA. A correlation of 0.855 and 0.836 was achieved between piRNA and BPPart and between piRNA and BPMax, respectively, in 37 degrees, and 0.920 and 0.904 in -180 degrees. We also discover two partner RNAs, SNORD3D and TRAF3, and hypothesize their potential roles in genetic diseases. We envision fusion of machine learning methods with the proposed algorithms in the future.
△ Less
Submitted 7 August, 2020; v1 submitted 2 April, 2019;
originally announced April 2019.
-
Analytical Cost Metrics : Days of Future Past
Authors:
Nirmal Prajapati,
Sanjay Rajopadhye,
Hristo Djidjev
Abstract:
As we move towards the exascale era, the new architectures must be capable of running the massive computational problems efficiently. Scientists and researchers are continuously investing in tuning the performance of extreme-scale computational problems. These problems arise in almost all areas of computing, ranging from big data analytics, artificial intelligence, search, machine learning, virtua…
▽ More
As we move towards the exascale era, the new architectures must be capable of running the massive computational problems efficiently. Scientists and researchers are continuously investing in tuning the performance of extreme-scale computational problems. These problems arise in almost all areas of computing, ranging from big data analytics, artificial intelligence, search, machine learning, virtual/augmented reality, computer vision, image/signal processing to computational science and bioinformatics. With Moore's law driving the evolution of hardware platforms towards exascale, the dominant performance metric (time efficiency) has now expanded to also incorporate power/energy efficiency. Therefore, the major challenge that we face in computing systems research is: "how to solve massive-scale computational problems in the most time/power/energy efficient manner?"
The architectures are constantly evolving making the current performance optimizing strategies less applicable and new strategies to be invented. The solution is for the new architectures, new programming models, and applications to go forward together. Doing this is, however, extremely hard. There are too many design choices in too many dimensions. We propose the following strategy to solve the problem: (i) Models - Develop accurate analytical models (e.g. execution time, energy, silicon area) to predict the cost of executing a given program, and (ii) Complete System Design - Simultaneously optimize all the cost models for the programs (computational problems) to obtain the most time/area/power/energy efficient solution. Such an optimization problem evokes the notion of codesign.
△ Less
Submitted 5 February, 2018;
originally announced February 2018.
-
PCOT: Cache Oblivious Tiling of Polyhedral Programs
Authors:
Waruna Ranasinghe,
Nirmal Prajapati,
Tomofumi Yuki,
Sanjay Rajopadhye
Abstract:
This paper studies two variants of tiling: iteration space tiling (or loop blocking) and cache-oblivious methods that recursively split the iteration space with divide-and-conquer. The key question to answer is when we should be using one over the other. The answer to this question is complicated for modern architecture due to a number of reasons. In this paper, we present a detailed empirical stu…
▽ More
This paper studies two variants of tiling: iteration space tiling (or loop blocking) and cache-oblivious methods that recursively split the iteration space with divide-and-conquer. The key question to answer is when we should be using one over the other. The answer to this question is complicated for modern architecture due to a number of reasons. In this paper, we present a detailed empirical study to answer this question for a range of kernels that fit the polyhedral model. Our study is based on a generalized cache oblivious code generator that support this class, which is a superset of those supported by existing tools. The conclusion is that cache oblivious code is most useful when the aim is to have reduced off-chip memory accesses, e.g., lower energy, albeit certain situations that diminish its effectiveness exist.
△ Less
Submitted 1 February, 2018;
originally announced February 2018.
-
Accelerator Codesign as Non-Linear Optimization
Authors:
Nirmal Prajapati,
Sanjay Rajopadhye,
Hristo Djidjev,
Nandkishore Santhi,
Tobias Grosser,
Rumen Andonov
Abstract:
We propose an optimization approach for determining both hardware and software parameters for the efficient implementation of a (family of) applications called dense stencil computations on programmable GPGPUs. We first introduce a simple, analytical model for the silicon area usage of accelerator architectures and a workload characterization of stencil computations. We combine this characterizati…
▽ More
We propose an optimization approach for determining both hardware and software parameters for the efficient implementation of a (family of) applications called dense stencil computations on programmable GPGPUs. We first introduce a simple, analytical model for the silicon area usage of accelerator architectures and a workload characterization of stencil computations. We combine this characterization with a parametric execution time model and formulate a mathematical optimization problem. That problem seeks to maximize a common objective function of 'all the hardware and software parameters'. The solution to this problem, therefore "solves" the codesign problem: simultaneously choosing software-hardware parameters to optimize total performance.
We validate this approach by proposing architectural variants of the NVIDIA Maxwell GTX-980 (respectively, Titan X) specifically tuned to a predetermined workload of four common 2D stencils (Heat, Jacobi, Laplacian, and Gradient) and two 3D ones (Heat and Laplacian). Our model predicts that performance would potentially improve by 28% (respectively, 33%) with simple tweaks to the hardware parameters such as adapting coarse and fine-grained parallelism by changing the number of streaming multiprocessors and the number of compute cores each contains. We propose a set of Pareto-optimal design points to exploit the trade-off between performance and silicon area and show that by additionally eliminating GPU caches, we can get a further 2-fold improvement.
△ Less
Submitted 13 December, 2017;
originally announced December 2017.
-
Hybrid Static/Dynamic Schedules for Tiled Polyhedral Programs
Authors:
Tian **,
Nirmal Prajapati,
Waruna Ranasinghe,
Guillaume Iooss,
Yun Zou,
Sanjay Rajopadhye,
David Wonnacott
Abstract:
Polyhedral compilers perform optimizations such as tiling and parallelization; when doing both, they usually generate code that executes "barrier-synchronized wavefronts" of tiles. We present a system to express and generate code for hybrid schedules, where some constraints are automatically satisfied through the structure of the code, and the remainder are dynamically enforced at run-time with da…
▽ More
Polyhedral compilers perform optimizations such as tiling and parallelization; when doing both, they usually generate code that executes "barrier-synchronized wavefronts" of tiles. We present a system to express and generate code for hybrid schedules, where some constraints are automatically satisfied through the structure of the code, and the remainder are dynamically enforced at run-time with data flow mechanisms. We prove bounds on the added overheads that are better, by at least one polynomial degree, than those of previous techniques.
We propose a generic mechanism to implement the needed synchronization, and show it can be easily realized for a variety of targets: OpenMP, Pthreads, GPU (CUDA or OpenCL) code, languages like X10, Habanero, Cilk, as well as data flow platforms like DAGuE, and OpenStream and MPI. We also provide a simple concrete implementation that works without the need of any sophisticated run-time mechanism.
Our experiments show our simple implementation to be competitive or better than the wavefront-synchronized code generated by other systems. We also show how the proposed mechanism can achieve 24% to 70% reduction in energy.
△ Less
Submitted 23 October, 2016;
originally announced October 2016.
-
Checking Race Freedom of Clocked X10 Programs
Authors:
Tomofumi Yuki,
Paul Feautrier,
Sanjay Rajopadhye,
Vijay Saraswat
Abstract:
One of many approaches to better take advantage of parallelism, which has now become mainstream, is the introduction of parallel programming languages. However, parallelism is by nature non-deterministic, and not all parallel bugs can be avoided by language design. This paper proposes a method for guaranteeing absence of data races in the polyhedral subset of clocked X10 programs.
Clocks in X10…
▽ More
One of many approaches to better take advantage of parallelism, which has now become mainstream, is the introduction of parallel programming languages. However, parallelism is by nature non-deterministic, and not all parallel bugs can be avoided by language design. This paper proposes a method for guaranteeing absence of data races in the polyhedral subset of clocked X10 programs.
Clocks in X10 are similar to barriers, but are more dynamic; the subset of processes that participate in the synchronization can dynamically change at runtime. We construct the happens-before relation for clocked X10 programs, and show that the problem of race detection is undecidable. However, in many practical cases, modern tools are able to find solutions or disprove their existence. We present a set of benchmarks for which the analysis is possible and has an acceptable running time.
△ Less
Submitted 18 November, 2013;
originally announced November 2013.