Search | arXiv e-print repository

SME: A High Productivity FPGA Tool for Software Programmers

Authors: Carl-Johannes Johnsen, Alberte Thegler, Kenneth Skovhede, Brian Vinter

Abstract: For several decades, the CPU has been the standard model to use in the majority of computing. While the CPU does excel in some areas, heterogeneous computing, such as reconfigurable hardware, is showing increasing potential in areas like parallelization, performance, and power usage. This is especially prominent in problems favoring deep pipelining or tight latency requirements. However, due to th… ▽ More For several decades, the CPU has been the standard model to use in the majority of computing. While the CPU does excel in some areas, heterogeneous computing, such as reconfigurable hardware, is showing increasing potential in areas like parallelization, performance, and power usage. This is especially prominent in problems favoring deep pipelining or tight latency requirements. However, due to the nature of these problems, they can be hard to program, at least for software developers. Synchronous Message Exchange (SME) is a runtime environment that allows development, testing and verification of hardware designs for FPGA devices in C#, with access to modern debugging and code features. The goal is to create a framework for software developers to easily implement systems for FPGA devices without having to obtain heavy hardware programming knowledge. This article presents a short introduction to the SME model as well as new updates to SME. Lastly, a selection of student projects and examples will be presented in order to show how it is possible to create quite complex structures in SME, even by students with no hardware experience. △ Less

Submitted 20 April, 2021; originally announced April 2021.

arXiv:1504.06474 [pdf, ps, other]

doi 10.1016/j.parco.2015.04.004

Speculative Segmented Sum for Sparse Matrix-Vector Multiplication on Heterogeneous Processors

Authors: Weifeng Liu, Brian Vinter

Abstract: Sparse matrix-vector multiplication (SpMV) is a central building block for scientific software and graph applications. Recently, heterogeneous processors composed of different types of cores attracted much attention because of their flexible core configuration and high energy efficiency. In this paper, we propose a compressed sparse row (CSR) format based SpMV algorithm utilizing both types of cor… ▽ More Sparse matrix-vector multiplication (SpMV) is a central building block for scientific software and graph applications. Recently, heterogeneous processors composed of different types of cores attracted much attention because of their flexible core configuration and high energy efficiency. In this paper, we propose a compressed sparse row (CSR) format based SpMV algorithm utilizing both types of cores in a CPU-GPU heterogeneous processor. We first speculatively execute segmented sum operations on the GPU part of a heterogeneous processor and generate a possibly incorrect results. Then the CPU part of the same chip is triggered to re-arrange the predicted partial sums for a correct resulting vector. On three heterogeneous processors from Intel, AMD and nVidia, using 20 sparse matrices as a benchmark suite, the experimental results show that our method obtains significant performance improvement over the best existing CSR-based SpMV algorithms. The source code of this work is downloadable at https://github.com/bhSPARSE/Benchmark_SpMV_using_CSR △ Less

Submitted 14 September, 2015; v1 submitted 24 April, 2015; originally announced April 2015.

Comments: 22 pages, 8 figures, Published at Parallel Computing (PARCO)

MSC Class: 65F50 ACM Class: G.4; G.1.3

arXiv:1504.05022 [pdf, ps, other]

doi 10.1016/j.jpdc.2015.06.010

A Framework for General Sparse Matrix-Matrix Multiplication on GPUs and Heterogeneous Processors

Authors: Weifeng Liu, Brian Vinter

Abstract: General sparse matrix-matrix multiplication (SpGEMM) is a fundamental building block for numerous applications such as algebraic multigrid method (AMG), breadth first search and shortest path problem. Compared to other sparse BLAS routines, an efficient parallel SpGEMM implementation has to handle extra irregularity from three aspects: (1) the number of nonzero entries in the resulting sparse matr… ▽ More General sparse matrix-matrix multiplication (SpGEMM) is a fundamental building block for numerous applications such as algebraic multigrid method (AMG), breadth first search and shortest path problem. Compared to other sparse BLAS routines, an efficient parallel SpGEMM implementation has to handle extra irregularity from three aspects: (1) the number of nonzero entries in the resulting sparse matrix is unknown in advance, (2) very expensive parallel insert operations at random positions in the resulting sparse matrix dominate the execution time, and (3) load balancing must account for sparse data in both input matrices. In this work we propose a framework for SpGEMM on GPUs and emerging CPU-GPU heterogeneous processors. This framework particularly focuses on the above three problems. Memory pre-allocation for the resulting matrix is organized by a hybrid method that saves a large amount of global memory space and efficiently utilizes the very limited on-chip scratchpad memory. Parallel insert operations of the nonzero entries are implemented through the GPU merge path algorithm that is experimentally found to be the fastest GPU merge approach. Load balancing builds on the number of necessary arithmetic operations on the nonzero entries and is guaranteed in all stages. Compared with the state-of-the-art CPU and GPU SpGEMM methods, our approach delivers excellent absolute performance and relative speedups on various benchmarks multiplying matrices with diverse sparsity structures. Furthermore, on heterogeneous processors, our SpGEMM approach achieves higher throughput by using re-allocatable shared virtual memory. The source code of this work is available at https://github.com/bhSPARSE/Benchmark_SpGEMM_using_CSR △ Less

Submitted 13 September, 2015; v1 submitted 20 April, 2015; originally announced April 2015.

Comments: 25 pages, 12 figures, published at Journal of Parallel and Distributed Computing (JPDC)

MSC Class: 65F50 ACM Class: G.4; G.1.3

arXiv:1503.05032 [pdf, ps, other]

CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication

Authors: Weifeng Liu, Brian Vinter

Abstract: Sparse matrix-vector multiplication (SpMV) is a fundamental building block for numerous applications. In this paper, we propose CSR5 (Compressed Sparse Row 5), a new storage format, which offers high-throughput SpMV on various platforms including CPUs, GPUs and Xeon Phi. First, the CSR5 format is insensitive to the sparsity structure of the input matrix. Thus the single format can support an SpMV… ▽ More Sparse matrix-vector multiplication (SpMV) is a fundamental building block for numerous applications. In this paper, we propose CSR5 (Compressed Sparse Row 5), a new storage format, which offers high-throughput SpMV on various platforms including CPUs, GPUs and Xeon Phi. First, the CSR5 format is insensitive to the sparsity structure of the input matrix. Thus the single format can support an SpMV algorithm that is efficient both for regular matrices and for irregular matrices. Furthermore, we show that the overhead of the format conversion from the CSR to the CSR5 can be as low as the cost of a few SpMV operations. We compare the CSR5-based SpMV algorithm with 11 state-of-the-art formats and algorithms on four mainstream processors using 14 regular and 10 irregular matrices as a benchmark suite. For the 14 regular matrices in the suite, we achieve comparable or better performance over the previous work. For the 10 irregular matrices, the CSR5 obtains average performance improvement of 17.6\%, 28.5\%, 173.0\% and 293.3\% (up to 213.3\%, 153.6\%, 405.1\% and 943.3\%) over the best existing work on dual-socket Intel CPUs, an nVidia GPU, an AMD GPU and an Intel Xeon Phi, respectively. For real-world applications such as a solver with only tens of iterations, the CSR5 format can be more practical because of its low-overhead for format conversion. The source code of this work is downloadable at https://github.com/bhSPARSE/Benchmark_SpMV_using_CSR5 △ Less

Submitted 9 April, 2015; v1 submitted 17 March, 2015; originally announced March 2015.

Comments: 12 pages, 10 figures, In Proceedings of the 29th ACM International Conference on Supercomputing (ICS '15)

MSC Class: 65F50 ACM Class: G.4; G.1.3

arXiv:1210.7774 [pdf, other]

cphVB: A System for Automated Runtime Optimization and Parallelization of Vectorized Applications

Authors: Mads Ruben Burgdorff Kristensen, Simon Andreas Frimann Lund, Troels Blum, Brian Vinter

Abstract: Modern processor architectures, in addition to having still more cores, also require still more consideration to memory-layout in order to run at full capacity. The usefulness of most languages is deprecating as their abstractions, structures or objects are hard to map onto modern processor architectures efficiently. The work in this paper introduces a new abstract machine framework, cphVB, that… ▽ More Modern processor architectures, in addition to having still more cores, also require still more consideration to memory-layout in order to run at full capacity. The usefulness of most languages is deprecating as their abstractions, structures or objects are hard to map onto modern processor architectures efficiently. The work in this paper introduces a new abstract machine framework, cphVB, that enables vector oriented high-level programming languages to map onto a broad range of architectures efficiently. The idea is to close the gap between high-level languages and hardware optimized low-level implementations. By translating high-level vector operations into an intermediate vector bytecode, cphVB enables specialized vector engines to efficiently execute the vector operations. The primary success parameters are to maintain a complete abstraction from low-level details and to provide efficient code execution across different, modern, processors. We evaluate the presented design through a setup that targets multi-core CPU architectures. We evaluate the performance of the implementation using Python implementations of well-known algorithms: a jacobi solver, a kNN search, a shallow water simulation and a synthetic stencil simulation. All demonstrate good performance. △ Less

Submitted 25 March, 2013; v1 submitted 26 October, 2012; originally announced October 2012.

Journal ref: Proceedings of The 11th Python In Science Conference (SciPy 2012)

arXiv:1201.3804 [pdf, ps, other]

doi 10.1109/HPCC.2012.80

Managing Communication Latency-Hiding at Runtime for Parallel Programming Languages and Libraries

Authors: Mads Ruben Burgdorff Kristensen, Brian Vinter

Abstract: This work introduces a runtime model for managing communication with support for latency-hiding. The model enables non-computer science researchers to exploit communication latency-hiding techniques seamlessly. For compiled languages, it is often possible to create efficient schedules for communication, but this is not the case for interpreted languages. By maintaining data dependencies between sc… ▽ More This work introduces a runtime model for managing communication with support for latency-hiding. The model enables non-computer science researchers to exploit communication latency-hiding techniques seamlessly. For compiled languages, it is often possible to create efficient schedules for communication, but this is not the case for interpreted languages. By maintaining data dependencies between scheduled operations, it is possible to aggressively initiate communication and lazily evaluate tasks to allow maximal time for the communication to finish before entering a wait state. We implement a heuristic of this model in DistNumPy, an auto-parallelizing version of numerical Python that allows sequential NumPy programs to run on distributed memory architectures. Furthermore, we present performance comparisons for eight benchmarks with and without automatic latency-hiding. The results shows that our model reduces the time spent on waiting for communication as much as 27 times, from a maximum of 54% to only 2% of the total execution time, in a stencil application. △ Less

Submitted 18 January, 2012; originally announced January 2012.

Comments: PREPRINT

Journal ref: Proceeding HPCC '12 Proceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems

Showing 1–6 of 6 results for author: Vinter, B