-
High-Performance Simultaneous Multiprocessing for Heterogeneous System-on-Chip
Authors:
Kris Nikov,
Mohammad Hosseinabady,
Rafael Asenjo,
Andrés Rodríguezz,
Angeles Navarro,
Jose Nunez-Yanez
Abstract:
This paper presents a methodology for simultaneous heterogeneous computing, named ENEAC, where a quad core ARM Cortex-A53 CPU works in tandem with a preprogrammed on-board FPGA accelerator. A heterogeneous scheduler distributes the tasks optimally among all the resources and all compute units run asynchronously, which allows for improved performance for irregular workloads. ENEAC achieves up to 17…
▽ More
This paper presents a methodology for simultaneous heterogeneous computing, named ENEAC, where a quad core ARM Cortex-A53 CPU works in tandem with a preprogrammed on-board FPGA accelerator. A heterogeneous scheduler distributes the tasks optimally among all the resources and all compute units run asynchronously, which allows for improved performance for irregular workloads. ENEAC achieves up to 17\% performance improvement \ignore{and 14\% energy usage reduction,} when using all platform resources compared to just using the FPGA accelerators and up to 865\% performance increase \ignore{and up to 89\% energy usage decrease} when using just the CPU. The workflow uses existing commercial tools and C/C++ as a single programming language for both accelerator design and CPU programming for improved productivity and ease of verification.
△ Less
Submitted 20 August, 2020;
originally announced August 2020.
-
Run-Time Power Modelling in Embedded GPUs with Dynamic Voltage and Frequency Scaling
Authors:
Jose Nunez-Yanez,
Kris Nikov,
Kerstin Eder,
Mohammad Hosseinabady
Abstract:
This paper investigates the application of a robust CPU-based power modelling methodology that performs an automatic search of explanatory events derived from performance counters to embedded GPUs. A 64-bit Tegra TX1 SoC is configured with DVFS enabled and multiple CUDA benchmarks are used to train and test models optimized for each frequency and voltage point. These optimized models are then comp…
▽ More
This paper investigates the application of a robust CPU-based power modelling methodology that performs an automatic search of explanatory events derived from performance counters to embedded GPUs. A 64-bit Tegra TX1 SoC is configured with DVFS enabled and multiple CUDA benchmarks are used to train and test models optimized for each frequency and voltage point. These optimized models are then compared with a simpler unified model that uses a single set of model coefficients for all frequency and voltage points of interest. To obtain this unified model, a number of experiments are conducted to extract information on idle, clock and static power to derive power usage from a single reference equation. The results show that the unified model offers competitive accuracy with an average 5\% error with four explanatory variables on the test data set and it is capable to correctly predict the impact of voltage, frequency and temperature on power consumption. This model could be used to replace direct power measurements when these are not available due to hardware limitations or worst-case analysis in emulation platforms.
△ Less
Submitted 18 June, 2020;
originally announced June 2020.
-
Heterogeneous FPGA+GPU Embedded Systems: Challenges and Opportunities
Authors:
Mohammad Hosseinabady,
Mohd Amiruddin Bin Zainol,
Jose Nunez-Yanez
Abstract:
The edge computing paradigm has emerged to handle cloud computing issues such as scalability, security and low response time among others. This new computing trend heavily relies on ubiquitous embedded systems on the edge. Performance and energy consumption are two main factors that should be considered during the design of such systems. Focusing on performance and energy consumption, this paper s…
▽ More
The edge computing paradigm has emerged to handle cloud computing issues such as scalability, security and low response time among others. This new computing trend heavily relies on ubiquitous embedded systems on the edge. Performance and energy consumption are two main factors that should be considered during the design of such systems. Focusing on performance and energy consumption, this paper studies the opportunities and challenges that a heterogeneous embedded system consisting of embedded FPGAs and GPUs (as accelerators) can provide for applications. We study three design, modeling and scheduling challenges throughout the paper. We also propose three techniques to cope with these three challenges. Applying the proposed techniques to three applications including image histogram, dense matrix-vector multiplication and sparse matrix-vector multiplications show 1.79x and 2.29x improvements in performance and energy consumption, respectively, when both FPGA and GPU execute the corresponding application in parallel.
△ Less
Submitted 25 January, 2019; v1 submitted 18 January, 2019;
originally announced January 2019.
-
Parallelizing Workload Execution in Embedded and High-Performance Heterogeneous Systems
Authors:
Jose Nunez-Yanez,
Mohammad Hosseinabady,
Moslem Amiri,
Andrés Rodríguez,
Rafael Asenjo,
Angeles Navarro,
Rubén Gran-Tejero,
Darío Suárez-Gracia
Abstract:
In this paper, we introduce a software-defined framework that enables the parallel utilization of all the programmable processing resources available in heterogeneous system-on-chip (SoC) including FPGA-based hardware accelerators and programmable CPUs. Two platforms with different architectures are considered, and a single C/C++ source code is used in both of them for the CPU and FPGA resources.…
▽ More
In this paper, we introduce a software-defined framework that enables the parallel utilization of all the programmable processing resources available in heterogeneous system-on-chip (SoC) including FPGA-based hardware accelerators and programmable CPUs. Two platforms with different architectures are considered, and a single C/C++ source code is used in both of them for the CPU and FPGA resources. Instead of simply using the hardware accelerator to offload a task from the CPU, we propose a scheduler that dynamically distributes the tasks among all the resources to fully exploit all computing devices while minimizing load unbalance. The multi-architecture study compares an ARMV7 and ARMV8 implementation with different number and type of CPU cores and also different FPGA micro-architecture and size. We measure that both platforms benefit from having the CPU cores assist FPGA execution at the same level of energy requirements.
△ Less
Submitted 9 February, 2018;
originally announced February 2018.
-
Simultaneous Reduction of Dynamic and Static Power in Scan Structures
Authors:
Shervin Sharifi,
Javid Jaffari,
Mohammad Hosseinabady,
Ali Afzali-Kusha,
Zainalabedin Navabi
Abstract:
Power dissipation during test is a major challenge in testing integrated circuits. Dynamic power has been the dominant part of power dissipation in CMOS circuits, however, in future technologies the static portion of power dissipation will outreach the dynamic portion. This paper proposes an efficient technique to reduce both dynamic and static power dissipation in scan structures. Scan cell out…
▽ More
Power dissipation during test is a major challenge in testing integrated circuits. Dynamic power has been the dominant part of power dissipation in CMOS circuits, however, in future technologies the static portion of power dissipation will outreach the dynamic portion. This paper proposes an efficient technique to reduce both dynamic and static power dissipation in scan structures. Scan cell outputs which are not on the critical path(s) are multiplexed to fixed values during scan mode. These constant values and primary inputs are selected such that the transitions occurred on non-multiplexed scan cells are suppressed and the leakage current during scan mode is decreased. A method for finding these vectors is also proposed. Effectiveness of this technique is proved by experiments performed on ISCAS89 benchmark circuits.
△ Less
Submitted 25 October, 2007;
originally announced October 2007.