-
EnergyAnalyzer: Using Static WCET Analysis Techniques to Estimate the Energy Consumption of Embedded Applications
Authors:
Simon Wegener,
Kris K. Nikov,
Jose Nunez-Yanez,
Kerstin Eder
Abstract:
This paper presents EnergyAnalyzer, a code-level static analysis tool for estimating the energy consumption of embedded software based on statically predictable hardware events. The tool utilises techniques usually used for worst-case execution time (WCET) analysis together with bespoke energy models developed for two predictable architectures - the ARM Cortex-M0 and the Gaisler LEON3 - to perform…
▽ More
This paper presents EnergyAnalyzer, a code-level static analysis tool for estimating the energy consumption of embedded software based on statically predictable hardware events. The tool utilises techniques usually used for worst-case execution time (WCET) analysis together with bespoke energy models developed for two predictable architectures - the ARM Cortex-M0 and the Gaisler LEON3 - to perform energy usage analysis. EnergyAnalyzer has been applied in various use cases, such as selecting candidates for an optimised convolutional neural network, analysing the energy consumption of a camera pill prototype, and analysing the energy consumption of satellite communications software. The tool was developed as part of a larger project called TeamPlay, which aimed to provide a toolchain for develo** embedded applications where energy properties are first-class citizens, allowing the developer to reflect directly on these properties at the source code level. The analysis capabilities of EnergyAnalyzer are validated across a large number of benchmarks for the two target architectures and the results show that the statically estimated energy consumption has, with a few exceptions, less than 1% difference compared to the underlying empirical energy models which have been validated on real hardware.
△ Less
Submitted 25 May, 2023; v1 submitted 24 May, 2023;
originally announced May 2023.
-
Big-Little Adaptive Neural Networks on Low-Power Near-Subthreshold Processors
Authors:
Zichao Shen,
Neil Howard,
Jose Nunez-Yanez
Abstract:
This paper investigates the energy savings that near-subthreshold processors can obtain in edge AI applications and proposes strategies to improve them while maintaining the accuracy of the application. The selected processors deploy adaptive voltage scaling techniques in which the frequency and voltage levels of the processor core are determined at the run-time. In these systems, embedded RAM and…
▽ More
This paper investigates the energy savings that near-subthreshold processors can obtain in edge AI applications and proposes strategies to improve them while maintaining the accuracy of the application. The selected processors deploy adaptive voltage scaling techniques in which the frequency and voltage levels of the processor core are determined at the run-time. In these systems, embedded RAM and flash memory size is typically limited to less than 1 megabyte to save power. This limited memory imposes restrictions on the complexity of the neural networks model that can be mapped to these devices and the required trade-offs between accuracy and battery life. To address these issues, we propose and evaluate alternative 'big-little' neural network strategies to improve battery life while maintaining prediction accuracy. The strategies are applied to a human activity recognition application selected as a demonstrator that shows that compared to the original network, the best configurations obtain an energy reduction measured at 80% while maintaining the original level of inference accuracy.
△ Less
Submitted 19 April, 2023;
originally announced April 2023.
-
Dynamically Reconfigurable Variable-precision Sparse-Dense Matrix Acceleration in Tensorflow Lite
Authors:
Jose Nunez-Yanez,
Andres Otero,
Eduardo de la Torre
Abstract:
In this paper, we present a dynamically reconfigurable hardware accelerator called FADES (Fused Architecture for DEnse and Sparse matrices). The FADES design offers multiple configuration options that trade off parallelism and complexity using a dataflow model to create four stages that read, compute, scale and write results. FADES is mapped to the programmable logic (PL) and integrated with the T…
▽ More
In this paper, we present a dynamically reconfigurable hardware accelerator called FADES (Fused Architecture for DEnse and Sparse matrices). The FADES design offers multiple configuration options that trade off parallelism and complexity using a dataflow model to create four stages that read, compute, scale and write results. FADES is mapped to the programmable logic (PL) and integrated with the TensorFlow Lite inference engine running on the processing system (PS) of a heterogeneous SoC device. The accelerator is used to compute the tensor operations, while the dynamically reconfigurable approach can be used to switch precision between int8 and float modes. This dynamic reconfiguration enables better performance by allowing more cores to be mapped to the resource-constrained device and lower power consumption compared with supporting both arithmetic precisions simultaneously. We compare the proposed hardware with a high-performance systolic architecture for dense matrices obtaining 25% better performance in dense mode with half the DSP blocks in the same technology. In sparse mode, we show that the core can outperform dense mode even at low sparsity levels, and a single-core achieves up to 20x acceleration over the software-optimized NEON RUY library.
△ Less
Submitted 17 April, 2023;
originally announced April 2023.
-
Accurate Energy Modelling on the Cortex-M0 Processor for Profiling and Static Analysis
Authors:
Kris Nikov,
Kyriakos Georgiou,
Zbigniew Chamski,
Kerstin Eder,
Jose Nunez-Yanez
Abstract:
Energy modelling can enable energy-aware software development and assist the developer in meeting an application's energy budget. Although many energy models for embedded processors exist, most do not account for processor-specific configurations, neither are they suitable for static energy consumption estimation. This paper introduces a set of comprehensive energy models for Arm's Cortex-M0 proce…
▽ More
Energy modelling can enable energy-aware software development and assist the developer in meeting an application's energy budget. Although many energy models for embedded processors exist, most do not account for processor-specific configurations, neither are they suitable for static energy consumption estimation. This paper introduces a set of comprehensive energy models for Arm's Cortex-M0 processor, ready to support energy-aware development of edge computing applications using either profiling- or static-analysis-based energy consumption estimation. We use a commercially representative physical platform together with a custom modified Instruction Set Simulator to obtain the physical data and system state markers used to generate the models. The models account for different processor configurations which all have a significant impact on the execution time and energy consumption of edge computing applications. Unlike existing works, which target a very limited set of applications, all developed models are generated and validated using a very wide range of benchmarks from a variety of emerging IoT application areas, including machine learning and have a prediction error of less than 5%.
△ Less
Submitted 30 January, 2023;
originally announced January 2023.
-
Robust and accurate fine-grain power models for embedded systems with no on-chip PMU
Authors:
Kris Nikov,
Marcos Martinez,
Simon Wegener,
Jose Nunez-Yanez,
Zbigniew Chamski,
Kyriakos Georgiou,
Kerstin Eder
Abstract:
This paper presents a novel approach to event-based power modelling for embedded platforms that do not have a Performance Monitoring Unit (PMU). The method involves complementing the target hardware platform, where the physical power data is measured, with another platform on which the CPU performance data, that is needed for model generation, can be collected. The methodology is used to generate…
▽ More
This paper presents a novel approach to event-based power modelling for embedded platforms that do not have a Performance Monitoring Unit (PMU). The method involves complementing the target hardware platform, where the physical power data is measured, with another platform on which the CPU performance data, that is needed for model generation, can be collected. The methodology is used to generate accurate fine-grain power models for the the Gaisler GR712RC dual-core LEON3 fault-tolerant SPARC processor with on-board power sensors and no PMU. A Kintex UltraScale FPGA is used as the support platform to obtain the required CPU performance data, by running a soft-core representation of the dual-core LEON3 as on the GR712RC but with a PMU implementation. Both platforms execute the same benchmark set and data collection is synchronised using per-sample timestamps so that the power sensor data from the GR712RC board can be matched to the PMU data from the FPGA. The synchronised samples are then processed by the Robust Energy and Power Predictor Selection (REPPS) software in order to generate power models. The models achieve less than 2% power estimation error when validated on an industrial use-case and can successfully follow program phases, which makes them suitable for runtime power profiling.
△ Less
Submitted 9 November, 2021; v1 submitted 26 May, 2021;
originally announced June 2021.
-
Evaluation of hybrid run-time power models for the ARM big.LITTLE architecture
Authors:
Kris Nikov,
Jose L. Nunez-Yanez,
Matthew Horsnell
Abstract:
Heterogeneous processors, formed by binary compatible CPU cores with different microarchitectures, enable energy reductions by better matching processing capabilities and software application requirements. This new hardware platform requires novel techniques to manage power and energy to fully utilize its capabilities, particularly regarding the map** of workloads to appropriate cores. In this p…
▽ More
Heterogeneous processors, formed by binary compatible CPU cores with different microarchitectures, enable energy reductions by better matching processing capabilities and software application requirements. This new hardware platform requires novel techniques to manage power and energy to fully utilize its capabilities, particularly regarding the map** of workloads to appropriate cores. In this paper we validate relevant published work related to power modelling for heterogeneous systems and propose a new approach for develo** run-time power models that uses a hybrid set of physical predictors, performance events and CPU state information. We demonstrate the accuracy of this approach compared with the state-of-the-art and its applicability to energy aware scheduling. Our results are obtained on a commercially available platform built around the Samsung Exynos 5 Octa SoC, which features the ARM big.LITTLE heterogeneous architecture.
△ Less
Submitted 24 August, 2020;
originally announced August 2020.
-
High-Performance Simultaneous Multiprocessing for Heterogeneous System-on-Chip
Authors:
Kris Nikov,
Mohammad Hosseinabady,
Rafael Asenjo,
Andrés Rodríguezz,
Angeles Navarro,
Jose Nunez-Yanez
Abstract:
This paper presents a methodology for simultaneous heterogeneous computing, named ENEAC, where a quad core ARM Cortex-A53 CPU works in tandem with a preprogrammed on-board FPGA accelerator. A heterogeneous scheduler distributes the tasks optimally among all the resources and all compute units run asynchronously, which allows for improved performance for irregular workloads. ENEAC achieves up to 17…
▽ More
This paper presents a methodology for simultaneous heterogeneous computing, named ENEAC, where a quad core ARM Cortex-A53 CPU works in tandem with a preprogrammed on-board FPGA accelerator. A heterogeneous scheduler distributes the tasks optimally among all the resources and all compute units run asynchronously, which allows for improved performance for irregular workloads. ENEAC achieves up to 17\% performance improvement \ignore{and 14\% energy usage reduction,} when using all platform resources compared to just using the FPGA accelerators and up to 865\% performance increase \ignore{and up to 89\% energy usage decrease} when using just the CPU. The workflow uses existing commercial tools and C/C++ as a single programming language for both accelerator design and CPU programming for improved productivity and ease of verification.
△ Less
Submitted 20 August, 2020;
originally announced August 2020.
-
Run-Time Power Modelling in Embedded GPUs with Dynamic Voltage and Frequency Scaling
Authors:
Jose Nunez-Yanez,
Kris Nikov,
Kerstin Eder,
Mohammad Hosseinabady
Abstract:
This paper investigates the application of a robust CPU-based power modelling methodology that performs an automatic search of explanatory events derived from performance counters to embedded GPUs. A 64-bit Tegra TX1 SoC is configured with DVFS enabled and multiple CUDA benchmarks are used to train and test models optimized for each frequency and voltage point. These optimized models are then comp…
▽ More
This paper investigates the application of a robust CPU-based power modelling methodology that performs an automatic search of explanatory events derived from performance counters to embedded GPUs. A 64-bit Tegra TX1 SoC is configured with DVFS enabled and multiple CUDA benchmarks are used to train and test models optimized for each frequency and voltage point. These optimized models are then compared with a simpler unified model that uses a single set of model coefficients for all frequency and voltage points of interest. To obtain this unified model, a number of experiments are conducted to extract information on idle, clock and static power to derive power usage from a single reference equation. The results show that the unified model offers competitive accuracy with an average 5\% error with four explanatory variables on the test data set and it is capable to correctly predict the impact of voltage, frequency and temperature on power consumption. This model could be used to replace direct power measurements when these are not available due to hardware limitations or worst-case analysis in emulation platforms.
△ Less
Submitted 18 June, 2020;
originally announced June 2020.
-
Relationship Estimation Metrics for Binary SoC Data
Authors:
Dave McEwan,
Jose Nunez-Yanez
Abstract:
System-on-Chip (SoC) designs are used in every aspect of computing and their optimization is a difficult but essential task in today's competitive market. Data taken from SoCs to achieve this is often characterised by very long concurrent bit vectors which have unknown relationships to each other. This paper explains and empirically compares the accuracy of several methods used to detect the exist…
▽ More
System-on-Chip (SoC) designs are used in every aspect of computing and their optimization is a difficult but essential task in today's competitive market. Data taken from SoCs to achieve this is often characterised by very long concurrent bit vectors which have unknown relationships to each other. This paper explains and empirically compares the accuracy of several methods used to detect the existence of these relationships in a wide range of systems. A probabilistic model is used to construct and test a large number of SoC-like systems with known relationships which are compared with the estimated relationships to give accuracy scores. The metrics Ċov and Dep based on covariance and independence are demonstrated to be the most useful, whereas metrics based on the Hamming distance and geometric approaches are shown to be less useful for detecting the presence of relationships between SoC data.
△ Less
Submitted 24 September, 2019; v1 submitted 10 May, 2019;
originally announced May 2019.
-
High-Performance Ultrasonic Levitation with FPGA-based Phased Arrays
Authors:
William Beasley,
Brenda Gatusch,
Daniel Connolly-Taylor,
Chenyuan Teng,
Asier Marzo,
Jose Nunez-Yanez
Abstract:
We present a flexible and self-contained platform for acoustic levitation research based on the Xilinx Zynq SoC using an array of ultrasonic emitters. The platform employs an inexpensive ZedBoard and provides fast movement of the levitated objects as well as object detection based on the produced echo. Several features available in the Zynq device are of benefit for this platform: hardware acceler…
▽ More
We present a flexible and self-contained platform for acoustic levitation research based on the Xilinx Zynq SoC using an array of ultrasonic emitters. The platform employs an inexpensive ZedBoard and provides fast movement of the levitated objects as well as object detection based on the produced echo. Several features available in the Zynq device are of benefit for this platform: hardware acceleration for the phase calculations, large number of parallel I/Os connected through the FPGA Mezzanine connector (FMC), integrated ADC capabilities to capture echo signals and ease of programmability due to a C-based design flow for both CPU and FPGA. A planar and spherical cap phased arrays are created and we investigate the capabilities and limitations of the different designs to improve the stability of the levitation process.
△ Less
Submitted 24 January, 2019; v1 submitted 18 January, 2019;
originally announced January 2019.
-
Heterogeneous FPGA+GPU Embedded Systems: Challenges and Opportunities
Authors:
Mohammad Hosseinabady,
Mohd Amiruddin Bin Zainol,
Jose Nunez-Yanez
Abstract:
The edge computing paradigm has emerged to handle cloud computing issues such as scalability, security and low response time among others. This new computing trend heavily relies on ubiquitous embedded systems on the edge. Performance and energy consumption are two main factors that should be considered during the design of such systems. Focusing on performance and energy consumption, this paper s…
▽ More
The edge computing paradigm has emerged to handle cloud computing issues such as scalability, security and low response time among others. This new computing trend heavily relies on ubiquitous embedded systems on the edge. Performance and energy consumption are two main factors that should be considered during the design of such systems. Focusing on performance and energy consumption, this paper studies the opportunities and challenges that a heterogeneous embedded system consisting of embedded FPGAs and GPUs (as accelerators) can provide for applications. We study three design, modeling and scheduling challenges throughout the paper. We also propose three techniques to cope with these three challenges. Applying the proposed techniques to three applications including image histogram, dense matrix-vector multiplication and sparse matrix-vector multiplications show 1.79x and 2.29x improvements in performance and energy consumption, respectively, when both FPGA and GPU execute the corresponding application in parallel.
△ Less
Submitted 25 January, 2019; v1 submitted 18 January, 2019;
originally announced January 2019.
-
Parallelizing Workload Execution in Embedded and High-Performance Heterogeneous Systems
Authors:
Jose Nunez-Yanez,
Mohammad Hosseinabady,
Moslem Amiri,
Andrés Rodríguez,
Rafael Asenjo,
Angeles Navarro,
Rubén Gran-Tejero,
Darío Suárez-Gracia
Abstract:
In this paper, we introduce a software-defined framework that enables the parallel utilization of all the programmable processing resources available in heterogeneous system-on-chip (SoC) including FPGA-based hardware accelerators and programmable CPUs. Two platforms with different architectures are considered, and a single C/C++ source code is used in both of them for the CPU and FPGA resources.…
▽ More
In this paper, we introduce a software-defined framework that enables the parallel utilization of all the programmable processing resources available in heterogeneous system-on-chip (SoC) including FPGA-based hardware accelerators and programmable CPUs. Two platforms with different architectures are considered, and a single C/C++ source code is used in both of them for the CPU and FPGA resources. Instead of simply using the hardware accelerator to offload a task from the CPU, we propose a scheduler that dynamically distributes the tasks among all the resources to fully exploit all computing devices while minimizing load unbalance. The multi-architecture study compares an ARMV7 and ARMV8 implementation with different number and type of CPU cores and also different FPGA micro-architecture and size. We measure that both platforms benefit from having the CPU cores assist FPGA execution at the same level of energy requirements.
△ Less
Submitted 9 February, 2018;
originally announced February 2018.
-
Energy Efficient Video Fusion with Heterogeneous CPU-FPGA Devices
Authors:
Jose Nunez-Yanez,
Tom Sun
Abstract:
This paper presents a complete video fusion system with hardware acceleration and investigates the energy trade-offs between computing in the CPU or the FPGA device. The video fusion application is based on the Dual-Tree Complex Wavelet Transforms (DT-CWT). In this work the transforms are mapped to a hardware accelerator using high-level synthesis tools for the FPGA and also vectorized code for th…
▽ More
This paper presents a complete video fusion system with hardware acceleration and investigates the energy trade-offs between computing in the CPU or the FPGA device. The video fusion application is based on the Dual-Tree Complex Wavelet Transforms (DT-CWT). In this work the transforms are mapped to a hardware accelerator using high-level synthesis tools for the FPGA and also vectorized code for the single instruction multiple data (SIMD) engine available in the CPU. The accelerated system reduces computation time and energy by a factor of 2. Moreover, the results show a key finding that the FPGA is not always the best choice for acceleration, and the SIMD engine should be selected when the wavelet decomposition reduces the frame size below a certain threshold. This dependency on workload size means that an adaptive system that intelligently selects between the SIMD engine and the FPGA achieves the most energy and performance efficiency point.
△ Less
Submitted 8 February, 2016;
originally announced February 2016.