Search | arXiv e-print repository

Deep Inverse Design for High-Level Synthesis

Authors: ** Chang, Tosiron Adegbija, Yuchao Liao, Claudio Talarico, Ao Li, Janet Roveda

Abstract: High-level synthesis (HLS) has significantly advanced the automation of digital circuits design, yet the need for expertise and time in pragma tuning remains challenging. Existing solutions for the design space exploration (DSE) adopt either heuristic methods, lacking essential information for further optimization potential, or predictive models, missing sufficient generalization due to the time-c… ▽ More High-level synthesis (HLS) has significantly advanced the automation of digital circuits design, yet the need for expertise and time in pragma tuning remains challenging. Existing solutions for the design space exploration (DSE) adopt either heuristic methods, lacking essential information for further optimization potential, or predictive models, missing sufficient generalization due to the time-consuming nature of HLS and the exponential growth of the design space. To address these challenges, we propose Deep Inverse Design for HLS (DID4HLS), a novel approach that integrates graph neural networks and generative models. DID4HLS iteratively optimizes hardware designs aimed at compute-intensive algorithms by learning conditional distributions of design features from post-HLS data. Compared to four state-of-the-art DSE baselines, our method achieved an average improvement of 42.5% on average distance to reference set (ADRS) compared to the best-performing baselines across six benchmarks, while demonstrating high robustness and efficiency. △ Less

Submitted 11 July, 2024; originally announced July 2024.

arXiv:2404.14769 [pdf]

doi 10.1016/j.suscom.2022.100741

A high-level synthesis approach for precisely-timed, energy-efficient embedded systems

Authors: Yuchao Liao, Tosiron Adegbija, Roman Lysecky

Abstract: Embedded systems continue to rapidly proliferate in diverse fields, including medical devices, autonomous vehicles, and more generally, the Internet of Things (IoT). Many embedded systems require application-specific hardware components to meet precise timing requirements within limited resource (area and energy) constraints. High-level synthesis (HLS) is an increasingly popular approach for impro… ▽ More Embedded systems continue to rapidly proliferate in diverse fields, including medical devices, autonomous vehicles, and more generally, the Internet of Things (IoT). Many embedded systems require application-specific hardware components to meet precise timing requirements within limited resource (area and energy) constraints. High-level synthesis (HLS) is an increasingly popular approach for improving the productivity of designing hardware and reducing the time/cost by using high-level languages to specify computational functionality and automatically generate hardware implementations. However, current HLS methods provide limited or no support to incorporate or utilize precise timing specifications within the synthesis and optimization process. In this paper, we present a hybrid high-level synthesis (H-HLS) framework that integrates state-based high-level synthesis (SB-HLS) with performance-driven high-level synthesis (PD-HLS) methods to enable the design and optimization of application-specific embedded systems in which timing information is explicitly and precisely defined in state-based system models. We demonstrate the results achieved by this H-HLS approach using case studies including a wearable pregnancy monitoring device, an ECG-based biometric authentication system, and a synthetic system, and compare the design space exploration results using two PD-HLS tools to show how H-HLS can provide low energy and area under timing constraints. △ Less

Submitted 23 April, 2024; originally announced April 2024.

Comments: Accepted at IGSC 2021, published in Sustainable Computing: Informatics and Systems (SUSCOM) 2022

Journal ref: Sustainable Computing: Informatics and Systems 35 (2022): 100741

arXiv:2404.14754 [pdf, other]

doi 10.1145/3649476.3658738

Skip the Benchmark: Generating System-Level High-Level Synthesis Data using Generative Machine Learning

Authors: Yuchao Liao, Tosiron Adegbija, Roman Lysecky, Ravi Tandon

Abstract: High-Level Synthesis (HLS) Design Space Exploration (DSE) is a widely accepted approach for efficiently exploring Pareto-optimal and optimal hardware solutions during the HLS process. Several HLS benchmarks and datasets are available for the research community to evaluate their methodologies. Unfortunately, these resources are limited and may not be sufficient for complex, multi-component system-l… ▽ More High-Level Synthesis (HLS) Design Space Exploration (DSE) is a widely accepted approach for efficiently exploring Pareto-optimal and optimal hardware solutions during the HLS process. Several HLS benchmarks and datasets are available for the research community to evaluate their methodologies. Unfortunately, these resources are limited and may not be sufficient for complex, multi-component system-level explorations. Generating new data using existing HLS benchmarks can be cumbersome, given the expertise and time required to effectively generate data for different HLS designs and directives. As a result, synthetic data has been used in prior work to evaluate system-level HLS DSE. However, the fidelity of the synthetic data to real data is often unclear, leading to uncertainty about the quality of system-level HLS DSE. This paper proposes a novel approach, called Vaegan, that employs generative machine learning to generate synthetic data that is robust enough to support complex system-level HLS DSE experiments that would be unattainable with only the currently available data. We explore and adapt a Variational Autoencoder (VAE) and Generative Adversarial Network (GAN) for this task and evaluate our approach using state-of-the-art datasets and metrics. We compare our approach to prior works and show that Vaegan effectively generates synthetic HLS data that closely mirrors the ground truth's distribution. △ Less

Submitted 23 April, 2024; originally announced April 2024.

Comments: Accepted at Great Lakes Symposium on VLSI 2024 (GLSVLSI 24)

arXiv:2402.06211 [pdf, other]

Fine-Tuning Surrogate Gradient Learning for Optimal Hardware Performance in Spiking Neural Networks

Authors: Ilkin Aliyev, Tosiron Adegbija

Abstract: The highly sparse activations in Spiking Neural Networks (SNNs) can provide tremendous energy efficiency benefits when carefully exploited in hardware. The behavior of sparsity in SNNs is uniquely shaped by the dataset and training hyperparameters. This work reveals novel insights into the impacts of training on hardware performance. Specifically, we explore the trade-offs between model accuracy a… ▽ More The highly sparse activations in Spiking Neural Networks (SNNs) can provide tremendous energy efficiency benefits when carefully exploited in hardware. The behavior of sparsity in SNNs is uniquely shaped by the dataset and training hyperparameters. This work reveals novel insights into the impacts of training on hardware performance. Specifically, we explore the trade-offs between model accuracy and hardware efficiency. We focus on three key hyperparameters: surrogate gradient functions, beta, and membrane threshold. Results on an FPGA-based hardware platform show that the fast sigmoid surrogate function yields a lower firing rate with similar accuracy compared to the arctangent surrogate on the SVHN dataset. Furthermore, by cross-swee** the beta and membrane threshold hyperparameters, we can achieve a 48% reduction in hardware-based inference latency with only 2.88% trade-off in inference accuracy compared to the default setting. Overall, this study highlights the importance of fine-tuning model hyperparameters as crucial for designing efficient SNN hardware accelerators, evidenced by the fine-tuned model achieving a 1.72x improvement in accelerator efficiency (FPS/W) compared to the most recent work. △ Less

Submitted 9 February, 2024; originally announced February 2024.

arXiv:2402.06210 [pdf, other]

PULSE: Parametric Hardware Units for Low-power Sparsity-Aware Convolution Engine

Authors: Ilkin Aliyev, Tosiron Adegbija

Abstract: Spiking Neural Networks (SNNs) have become popular for their more bio-realistic behavior than Artificial Neural Networks (ANNs). However, effectively leveraging the intrinsic, unstructured sparsity of SNNs in hardware is challenging, especially due to the variability in sparsity across network layers. This variability depends on several factors, including the input dataset, encoding scheme, and ne… ▽ More Spiking Neural Networks (SNNs) have become popular for their more bio-realistic behavior than Artificial Neural Networks (ANNs). However, effectively leveraging the intrinsic, unstructured sparsity of SNNs in hardware is challenging, especially due to the variability in sparsity across network layers. This variability depends on several factors, including the input dataset, encoding scheme, and neuron model. Most existing SNN accelerators fail to account for the layer-specific workloads of an application (model + dataset), leading to high energy consumption. To address this, we propose a design-time parametric hardware generator that takes layer-wise sparsity and the number of processing elements as inputs and synthesizes the corresponding hardware. The proposed design compresses sparse spike trains using a priority encoder and efficiently shifts the activations across the network's layers. We demonstrate the robustness of our proposed approach by first profiling a given application's characteristics followed by performing efficient resource allocation. Results on a Xilinx Kintex FPGA (Field Programmable Gate Arrays) using MNIST, FashionMNIST, and SVHN datasets show a 3.14x improvement in accelerator efficiency (FPS/W) compared to a sparsity-oblivious systolic array-based accelerator. Compared to the most recent sparsity-aware work, our solution improves efficiency by 1.72x. △ Less

Submitted 9 February, 2024; originally announced February 2024.

Comments: IEEE International Symposium on Circuits and Systems (ISCAS) 2024

arXiv:2310.16745 [pdf, other]

Design Space Exploration of Sparsity-Aware Application-Specific Spiking Neural Network Accelerators

Authors: Ilkin Aliyev. Kama Svoboda, Tosiron Adegbija

Abstract: Spiking Neural Networks (SNNs) offer a promising alternative to Artificial Neural Networks (ANNs) for deep learning applications, particularly in resource-constrained systems. This is largely due to their inherent sparsity, influenced by factors such as the input dataset, the length of the spike train, and the network topology. While a few prior works have demonstrated the advantages of incorporat… ▽ More Spiking Neural Networks (SNNs) offer a promising alternative to Artificial Neural Networks (ANNs) for deep learning applications, particularly in resource-constrained systems. This is largely due to their inherent sparsity, influenced by factors such as the input dataset, the length of the spike train, and the network topology. While a few prior works have demonstrated the advantages of incorporating sparsity into the hardware design, especially in terms of reducing energy consumption, the impact on hardware resources has not yet been explored. This is where design space exploration (DSE) becomes crucial, as it allows for the optimization of hardware performance by tailoring both the hardware and model parameters to suit specific application needs. However, DSE can be extremely challenging given the potentially large design space and the interplay of hardware architecture design choices and application-specific model parameters. In this paper, we propose a flexible hardware design that leverages the sparsity of SNNs to identify highly efficient, application-specific accelerator designs. We develop a high-level, cycle-accurate simulation framework for this hardware and demonstrate the framework's benefits in enabling detailed and fine-grained exploration of SNN design choices, such as the layer-wise logical-to-hardware ratio (LHR). Our experimental results show that our design can (i) achieve up to $76\%$ reduction in hardware resources and (ii) deliver a speed increase of up to $31.25\times$, while requiring $27\%$ fewer hardware resources compared to sparsity-oblivious designs. We further showcase the robustness of our framework by varying spike train lengths with different neuron population sizes to find the optimal trade-off points between accuracy and hardware latency. △ Less

Submitted 25 October, 2023; originally announced October 2023.

Journal ref: IEEE Journal on Emerging and Selected Topics in Circuits and Systems (JETCAS) 2023

arXiv:2302.08632 [pdf, other]

jazznet: A Dataset of Fundamental Piano Patterns for Music Audio Machine Learning Research

Authors: Tosiron Adegbija

Abstract: This paper introduces the jazznet Dataset, a dataset of fundamental jazz piano music patterns for develo** machine learning (ML) algorithms in music information retrieval (MIR). The dataset contains 162520 labeled piano patterns, including chords, arpeggios, scales, and chord progressions with their inversions, resulting in more than 26k hours of audio and a total size of 95GB. The paper explain… ▽ More This paper introduces the jazznet Dataset, a dataset of fundamental jazz piano music patterns for develo** machine learning (ML) algorithms in music information retrieval (MIR). The dataset contains 162520 labeled piano patterns, including chords, arpeggios, scales, and chord progressions with their inversions, resulting in more than 26k hours of audio and a total size of 95GB. The paper explains the dataset's composition, creation, and generation, and presents an open-source Pattern Generator using a method called Distance-Based Pattern Structures (DBPS), which allows researchers to easily generate new piano patterns simply by defining the distances between pitches within the musical patterns. We demonstrate that the dataset can help researchers benchmark new models for challenging MIR tasks, using a convolutional recurrent neural network (CRNN) and a deep convolutional neural network. The dataset and code are available via: https://github.com/tosiron/jazznet. △ Less

Submitted 16 February, 2023; originally announced February 2023.

Comments: To Appear at IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023

arXiv:2009.11442 [pdf, other]

A Study of Runtime Adaptive Prefetching for STTRAM L1 Caches

Authors: Kyle Kuan, Tosiron Adegbija

Abstract: Spin-Transfer Torque RAM (STTRAM) is a promising alternative to SRAM in on-chip caches due to several advantages. These advantages include non-volatility, low leakage, high integration density, and CMOS compatibility. Prior studies have shown that relaxing and adapting the STTRAM retention time to runtime application needs can substantially reduce overall cache energy without significant latency o… ▽ More Spin-Transfer Torque RAM (STTRAM) is a promising alternative to SRAM in on-chip caches due to several advantages. These advantages include non-volatility, low leakage, high integration density, and CMOS compatibility. Prior studies have shown that relaxing and adapting the STTRAM retention time to runtime application needs can substantially reduce overall cache energy without significant latency overheads, due to the lower STTRAM write energy and latency in shorter retention times. In this paper, as a first step towards efficient prefetching across the STTRAM cache hierarchy, we study prefetching in reduced retention STTRAM L1 caches. Using SPEC CPU 2017 benchmarks, we analyze the energy and latency impact of different prefetch distances in different STTRAM cache retention times for different applications. We show that expired_unused_prefetches---the number of unused prefetches expired by the reduced retention time STTRAM cache---can accurately determine the best retention time for energy consumption and access latency. This new metric can also provide insights into the best prefetch distance for memory bandwidth consumption and prefetch accuracy. Based on our analysis and insights, we propose Prefetch-Aware Retention time Tuning (PART) and Retention time-based Prefetch Control (RPC). Compared to a base STTRAM cache, PART and RPC collectively reduced the average cache energy and latency by 22.24% and 24.59%, respectively. When the base architecture was augmented with the state-of-the-art near-side prefetch throttling (NST), PART+RPC reduced the average cache energy and latency by 3.50% and 3.59%, respectively, and reduced the hardware overhead by 54.55% △ Less

Submitted 23 September, 2020; originally announced September 2020.

Comments: To appear in IEEE International Conference on Computer Design (ICCD), October 2020

arXiv:1908.04744 [pdf, other]

Energy and Performance Analysis of STTRAM Caches for Mobile Applications

Authors: Kyle Kuan, Tosiron Adegbija

Abstract: Spin-Transfer Torque RAMs (STTRAMs) have been shown to offer much promise for implementing emerging cache architectures. This paper studies the viability of STTRAM caches for mobile workloads from the perspective of energy and latency. Specifically, we explore the benefits of reduced retention STTRAM caches for mobile applications. We analyze the characteristics of mobile applications' cache block… ▽ More Spin-Transfer Torque RAMs (STTRAMs) have been shown to offer much promise for implementing emerging cache architectures. This paper studies the viability of STTRAM caches for mobile workloads from the perspective of energy and latency. Specifically, we explore the benefits of reduced retention STTRAM caches for mobile applications. We analyze the characteristics of mobile applications' cache blocks and how those characteristics dictate the appropriate retention time for mobile device caches. We show that due to their inherently interactive nature, mobile applications' execution characteristics---and hence, STTRAM cache design requirements---differ from other kinds of applications. We also explore various STTRAM cache designs in both single and multicore systems, and at different cache levels, that can efficiently satisfy mobile applications' execution requirements, in order to maximize energy savings without introducing substantial latency overhead. △ Less

Submitted 8 August, 2019; originally announced August 2019.

Comments: To appear in IEEE International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC) 2019

arXiv:1908.02238 [pdf, other]

doi 10.1109/TPDS.2019.2929781

A Survey of Phase Classification Techniques for Characterizing Variable Application Behavior

Authors: Keeley Criswell, Tosiron Adegbija

Abstract: Adaptable computing is an increasingly important paradigm that specializes system resources to variable application requirements, environmental conditions, or user requirements. Adapting computing resources to variable application requirements (or application phases) is otherwise known as phase-based optimization. Phase-based optimization takes advantage of application phases, or execution interva… ▽ More Adaptable computing is an increasingly important paradigm that specializes system resources to variable application requirements, environmental conditions, or user requirements. Adapting computing resources to variable application requirements (or application phases) is otherwise known as phase-based optimization. Phase-based optimization takes advantage of application phases, or execution intervals of an application, that behave similarly, to enable effective and beneficial adaptability. In order for phase-based optimization to be effective, the phases must first be classified to determine when application phases begin and end, and ensure that system resources are accurately specialized. In this paper, we present a survey of phase classification techniques that have been proposed to exploit the advantages of adaptable computing through phase-based optimization. We focus on recent techniques and classify these techniques with respect to several factors in order to highlight their similarities and differences. We divide the techniques by their major defining characteristics---online/offline and serial/parallel. In addition, we discuss other characteristics such as prediction and detection techniques, the characteristics used for prediction, interval type, etc. We also identify gaps in the state-of-the-art and discuss future research directions to enable and fully exploit the benefits of adaptable computing. △ Less

Submitted 16 July, 2019; originally announced August 2019.

Comments: To appear in IEEE Transactions on Parallel and Distributed Systems (TPDS)

arXiv:1905.07511 [pdf, other]

doi 10.1109/TC.2019.2918153

HALLS: An Energy-Efficient Highly Adaptable Last Level STT-RAM Cache for Multicore Systems

Authors: Kyle Kuan, Tosiron Adegbija

Abstract: Spin-Transfer Torque RAM (STT-RAM) is widely considered a promising alternative to SRAM in the memory hierarchy due to STT-RAM's non-volatility, low leakage power, high density, and fast read speed. The STT-RAM's small feature size is particularly desirable for the last-level cache (LLC), which typically consumes a large area of silicon die. However, long write latency and high write energy still… ▽ More Spin-Transfer Torque RAM (STT-RAM) is widely considered a promising alternative to SRAM in the memory hierarchy due to STT-RAM's non-volatility, low leakage power, high density, and fast read speed. The STT-RAM's small feature size is particularly desirable for the last-level cache (LLC), which typically consumes a large area of silicon die. However, long write latency and high write energy still remain challenges of implementing STT-RAMs in the CPU cache. An increasingly popular method for addressing this challenge involves trading off the non-volatility for reduced write speed and write energy by relaxing the STT-RAM's data retention time. However, in order to maximize energy saving potential, the cache configurations, including STT-RAM's retention time, must be dynamically adapted to executing applications' variable memory needs. In this paper, we propose a highly adaptable last level STT-RAM cache (HALLS) that allows the LLC configurations and retention time to be adapted to applications' runtime execution requirements. We also propose low-overhead runtime tuning algorithms to dynamically determine the best (lowest energy) cache configurations and retention times for executing applications. Compared to prior work, HALLS reduced the average energy consumption by 60.57% in a quad-core system, while introducing marginal latency overhead. △ Less

Submitted 17 May, 2019; originally announced May 2019.

Comments: To Appear on IEEE Transactions on Computers (TC)

arXiv:1904.09363 [pdf, other]

doi 10.1109/TCAD.2019.2912920

Energy-Efficient Runtime Adaptable L1 STT-RAM Cache Design

Authors: Kyle Kuan, Tosiron Adegbija

Abstract: Much research has shown that applications have variable runtime cache requirements. In the context of the increasingly popular Spin-Transfer Torque RAM (STT-RAM) cache, the retention time, which defines how long the cache can retain a cache block in the absence of power, is one of the most important cache requirements that may vary for different applications. In this paper, we propose a Logically… ▽ More Much research has shown that applications have variable runtime cache requirements. In the context of the increasingly popular Spin-Transfer Torque RAM (STT-RAM) cache, the retention time, which defines how long the cache can retain a cache block in the absence of power, is one of the most important cache requirements that may vary for different applications. In this paper, we propose a Logically Adaptable Retention Time STT-RAM (LARS) cache that allows the retention time to be dynamically adapted to applications' runtime requirements. LARS cache comprises of multiple STT-RAM units with different retention times, with only one unit being used at a given time. LARS dynamically determines which STT-RAM unit to use during runtime, based on executing applications' needs. As an integral part of LARS, we also explore different algorithms to dynamically determine the best retention time based on different cache design tradeoffs. Our experiments show that by adapting the retention time to different applications' requirements, LARS cache can reduce the average cache energy by 25.31%, compared to prior work, with minimal overheads. △ Less

Submitted 19 April, 2019; originally announced April 2019.

arXiv:1603.02393 [pdf, other]

doi 10.1109/TCAD.2017.2717782

Microprocessor Optimizations for the Internet of Things: A Survey

Authors: Tosiron Adegbija, Anita Rogacs, Chandrakant Patel, Ann Gordon-Ross

Abstract: The Internet of Things (IoT) refers to a pervasive presence of interconnected and uniquely identifiable physical devices. These devices' goal is to gather data and drive actions in order to improve productivity, and ultimately reduce or eliminate reliance on human intervention for data acquisition, interpretation, and use. The proliferation of these connected low-power devices will result in a dat… ▽ More The Internet of Things (IoT) refers to a pervasive presence of interconnected and uniquely identifiable physical devices. These devices' goal is to gather data and drive actions in order to improve productivity, and ultimately reduce or eliminate reliance on human intervention for data acquisition, interpretation, and use. The proliferation of these connected low-power devices will result in a data explosion that will significantly increase data transmission costs with respect to energy consumption and latency. Edge computing reduces these costs by performing computations at the edge nodes, prior to data transmission, to interpret and/or utilize the data. While much research has focused on the IoT's connected nature and communication challenges, the challenges of IoT embedded computing with respect to device microprocessors has received much less attention. This paper explores IoT applications' execution characteristics from a microarchitectural perspective and the microarchitectural characteristics that will enable efficient and effective edge computing. To tractably represent a wide variety of next-generation IoT applications, we present a broad IoT application classification methodology based on application functions, to enable quicker workload characterizations for IoT microprocessors. We then survey and discuss potential microarchitectural optimizations and computing paradigms that will enable the design of right-provisioned microprocessors that are efficient, configurable, extensible, and scalable. This paper provides a foundation for the analysis and design of a diverse set of microprocessor architectures for next-generation IoT devices. △ Less

Submitted 20 February, 2018; v1 submitted 8 March, 2016; originally announced March 2016.

Comments: Published at IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD); Special Issue on Circuit and System Design for Internet of Things

arXiv:1602.04415 [pdf]

Phase distance map**: a phase-based cache tuning methodology for embedded systems

Authors: Tosiron Adegbija, Ann Gordon-Ross, Arslan Munir

Abstract: Networked embedded systems typically leverage a collection of low-power embedded systems (nodes) to collaboratively execute applications spanning diverse application domains (e.g., video, image processing, communication, etc.) with diverse application requirements. The individual networked nodes must operate under stringent constraints (e.g., energy, memory, etc.) and should be specialized to meet… ▽ More Networked embedded systems typically leverage a collection of low-power embedded systems (nodes) to collaboratively execute applications spanning diverse application domains (e.g., video, image processing, communication, etc.) with diverse application requirements. The individual networked nodes must operate under stringent constraints (e.g., energy, memory, etc.) and should be specialized to meet varying application requirements in order to adhere to these constraints. Phase-based tuning specializes system tunable parameters to the varying runtime requirements of different execution phases to meet optimization goals. Since the design space for tunable systems can be very large, one of the major challenges in phase-based tuning is determining the best configuration for each phase without incurring significant tuning overhead (e.g., energy and/or performance) during design space exploration. In this paper, we propose phase distance map**, which directly determines the best configuration for a phase, thereby eliminating design space exploration. Phase distance map** applies the correlation between the characteristics and best configuration of a known phase to determine the best configuration of a new phase. Experimental results verify that our phase distance map** approach, when applied to cache tuning, determines cache configurations within 1 % of the optimal configurations on average and yields an energy delay product savings of 27 % on average. △ Less

Submitted 13 February, 2016; originally announced February 2016.

Comments: 26 pages, Springer Design Automation for Embedded Systems, Special Issue on Networked Embedded Systems, 2014

arXiv:1602.04414 [pdf]

Temperature-aware Dynamic Optimization of Embedded Systems

Authors: Tosiron Adegbija, Ann Gordon-Ross

Abstract: Due to embedded systems` stringent design constraints, much prior work focused on optimizing energy consumption and/or performance. Since embedded systems typically have fewer cooling options, rising temperature, and thus temperature optimization, is an emergent concern. Most embedded systems only dissipate heat by passive convection, due to the absence of dedicated thermal management hardware mec… ▽ More Due to embedded systems` stringent design constraints, much prior work focused on optimizing energy consumption and/or performance. Since embedded systems typically have fewer cooling options, rising temperature, and thus temperature optimization, is an emergent concern. Most embedded systems only dissipate heat by passive convection, due to the absence of dedicated thermal management hardware mechanisms. The embedded system`s temperature not only affects the system`s reliability, but could also affect the performance, power, and cost. Thus, embedded systems require efficient thermal management techniques. However, thermal management can conflict with other optimization objectives, such as execution time and energy consumption. In this paper, we focus on managing the temperature using a synergy of cache optimization and dynamic frequency scaling, while also optimizing the execution time and energy consumption. This paper provides new insights on the impact of cache parameters on efficient temperature-aware cache tuning heuristics. In addition, we present temperature-aware phase-based tuning, TaPT, which determines Pareto optimal clock frequency and cache configurations for fine-grained execution time, energy, and temperature tradeoffs. TaPT enables autonomous system optimization and also allows designers to specify temperature constraints and optimization priorities. Experiments show that TaPT can effectively reduce execution time, energy, and temperature, while imposing minimal hardware overhead. △ Less

Submitted 13 February, 2016; originally announced February 2016.

Comments: 24 pages, 12 figures, Extended version of paper "Thermal-aware Phase-based Tuning of Embedded Systems" published in GLSVLSI 2014, Submitted to ACM TODAES

Showing 1–15 of 15 results for author: Adegbija, T