-
Exploiting Neural-Network Statistics for Low-Power DNN Inference
Authors:
Lennart Bamberg,
Ardalan Najafi,
Alberto Garcia-Ortiz
Abstract:
Specialized compute blocks have been developed for efficient DNN execution. However, due to the vast amount of data and parameter movements, the interconnects and on-chip memories form another bottleneck, impairing power and performance. This work addresses this bottleneck by contributing a low-power technique for edge-AI inference engines that combines overhead-free coding with a statistical anal…
▽ More
Specialized compute blocks have been developed for efficient DNN execution. However, due to the vast amount of data and parameter movements, the interconnects and on-chip memories form another bottleneck, impairing power and performance. This work addresses this bottleneck by contributing a low-power technique for edge-AI inference engines that combines overhead-free coding with a statistical analysis of the data and parameters of neural networks. Our approach reduces the interconnect and memory power consumption by up to 80% for state-of-the-art benchmarks while providing additional power savings for the compute blocks by up to 39%. These power improvements are achieved with no loss of accuracy and negligible hardware cost.
△ Less
Submitted 9 November, 2023;
originally announced November 2023.
-
MemPool-3D: Boosting Performance and Efficiency of Shared-L1 Memory Many-Core Clusters with 3D Integration
Authors:
Matheus Cavalcante,
Anthony Agnesina,
Samuel Riedel,
Moritz Brunion,
Alberto Garcia-Ortiz,
Dragomir Milojevic,
Francky Catthoor,
Sung Kyu Lim,
Luca Benini
Abstract:
Three-dimensional integrated circuits promise power, performance, and footprint gains compared to their 2D counterparts, thanks to drastic reductions in the interconnects' length through their smaller form factor. We can leverage the potential of 3D integration by enhancing MemPool, an open-source many-core design with 256 cores and a shared pool of L1 scratchpad memory connected with a low-latenc…
▽ More
Three-dimensional integrated circuits promise power, performance, and footprint gains compared to their 2D counterparts, thanks to drastic reductions in the interconnects' length through their smaller form factor. We can leverage the potential of 3D integration by enhancing MemPool, an open-source many-core design with 256 cores and a shared pool of L1 scratchpad memory connected with a low-latency interconnect. MemPool's baseline 2D design is severely limited by routing congestion and wire propagation delay, making the design ideal for 3D integration. In architectural terms, we increase MemPool's scratchpad memory capacity beyond the sweet spot for 2D designs, improving performance in a common digital signal processing kernel. We propose a 3D MemPool design that leverages a smart partitioning of the memory resources across two layers to balance the size and utilization of the stacked dies. In this paper, we explore the architectural and the technology parameter spaces by analyzing the power, performance, area, and energy efficiency of MemPool instances in 2D and 3D with 1 MiB, 2 MiB, 4 MiB, and 8 MiB of scratchpad memory in a commercial 28 nm technology node. We observe a performance gain of 9.1% when running a matrix multiplication on the MemPool-3D design with 4 MiB of scratchpad memory compared to the MemPool 2D counterpart. In terms of energy efficiency, we can implement the MemPool-3D instance with 4 MiB of L1 memory on an energy budget 15% smaller than its 2D counterpart, and even 3.7% smaller than the MemPool-2D instance with one-fourth of the L1 scratchpad memory capacity.
△ Less
Submitted 2 December, 2021;
originally announced December 2021.
-
Ratatoskr: An open-source framework for in-depth power, performance and area analysis in 3D NoCs
Authors:
Jan Moritz Joseph,
Lennart Bamberg,
Imad Hajjar,
Anna Drewes,
Behnam Razi Perjikolaei,
Alberto García-Ortiz,
Thilo Pionteck
Abstract:
We introduce ratatoskr, an open-source framework for in-depth power, performance and area (PPA) analysis in NoCs for 3D-integrated and heterogeneous System-on-Chips (SoCs). It covers all layers of abstraction by providing a NoC hardware implementation on RT level, a NoC simulator on cycle-accurate level and an application model on transaction level. By this comprehensive approach, ratatoskr can pr…
▽ More
We introduce ratatoskr, an open-source framework for in-depth power, performance and area (PPA) analysis in NoCs for 3D-integrated and heterogeneous System-on-Chips (SoCs). It covers all layers of abstraction by providing a NoC hardware implementation on RT level, a NoC simulator on cycle-accurate level and an application model on transaction level. By this comprehensive approach, ratatoskr can provide the following specific PPA analyses: Dynamic power of links can be measured within 2.4% accuracy of bit-level simulations while maintaining cycle-accurate simulation speed. Router power is determined from RT level synthesis combined with cycle-accurate simulations. The performance of the whole NoC can be measured both via cycle-accurate and RT level simulations. The performance of individual routers is obtained from RT level including gate-level verification. The NoC area is calculated from RT level. Despite these manifold features, ratatoskr offers easy two-step user interaction: First, a single point-of-entry that allows to set design parameters and second, PPA reports are generated automatically. For both the input and the output, different levels of abstraction can be chosen for high-level rapid network analysis or low-level improvement of architectural details. The synthesize NoC model reduces up to 32% total router power and 3% router area in comparison to a conventional standard router. As a forward-thinking and unique feature not found in other NoC PPA-measurement tools, ratatoskr supports heterogeneous 3D integration that is one of the most promising integration paradigms for upcoming SoCs. Thereby, ratatoskr lies the groundwork to design their communication architectures.
△ Less
Submitted 14 January, 2020; v1 submitted 11 December, 2019;
originally announced December 2019.
-
System-level optimization of Network-on-Chips for heterogeneous 3D System-on-Chips
Authors:
Jan Moritz Joseph,
Dominik Ermel,
Lennart Bamberg,
Alberto García-Ortiz,
Thilo Pionteck
Abstract:
For a system-level design of Networks-on-Chip for 3D heterogeneous System-on-Chip (SoC), the locations of components, routers and vertical links are determined from an application model and technology parameters. In conventional methods, the two inputs are accounted for separately; here, we define an integrated problem that considers both application model and technology parameters. We show that t…
▽ More
For a system-level design of Networks-on-Chip for 3D heterogeneous System-on-Chip (SoC), the locations of components, routers and vertical links are determined from an application model and technology parameters. In conventional methods, the two inputs are accounted for separately; here, we define an integrated problem that considers both application model and technology parameters. We show that this problem does not allow for exact solution in reasonable time, as common for many design problems. Therefore, we contribute a heuristic by proposing design steps, which are based on separation of intralayer and interlayer communication. The advantage is that this new problem can be solved with well-known methods. We use 3D Vision SoC case studies to quantify the advantages and the practical usability of the proposed optimization approach. We achieve up to 18.8% reduced white space and up to 12.4% better network performance in comparison to conventional approaches.
△ Less
Submitted 3 October, 2019; v1 submitted 30 September, 2019;
originally announced September 2019.
-
Transaction Level Analysis for a Clustered and Hardware-Enhanced Task Manager on Homogeneous Many-Core Systems
Authors:
Daniel Gregorek,
Robert Schmidt,
Alberto Garcia-Ortiz
Abstract:
The increasing parallelism of many-core systems demands for efficient strategies for the run-time system management. Due to the large number of cores the management overhead has a rising impact to the overall system performance. This work analyzes a clustered infrastructure of dedicated hardware nodes to manage a homogeneous many-core system. The hardware nodes implement a message passing protocol…
▽ More
The increasing parallelism of many-core systems demands for efficient strategies for the run-time system management. Due to the large number of cores the management overhead has a rising impact to the overall system performance. This work analyzes a clustered infrastructure of dedicated hardware nodes to manage a homogeneous many-core system. The hardware nodes implement a message passing protocol and perform the task map** and synchronization at run-time. To make meaningful map** decisions, the global management nodes employ a workload status communication mechanism.
This paper discusses the design-space of the dedicated infrastructure by means of task map** use-cases and a parallel benchmark including application-interference. We evaluate the architecture in terms of application speedup and analyze the mechanism for the status communication. A comparison versus centralized and fully-distributed configurations demonstrates the reduction of the computation and communication management overhead for our approach.
△ Less
Submitted 10 February, 2015;
originally announced February 2015.
-
Proceedings of the Workshop on High Performance Energy Efficient Embedded Systems (HIP3ES) 2015
Authors:
Francisco Corbera,
Andrés Rodríguez,
Rafael Asenjo,
Angeles Navarro,
Antonio Vilches,
Maria Garzaran,
Ismat Chaib Draa,
Jamel Tayeb,
Smail Niar,
Mikael Desertot,
Daniel Gregorek,
Robert Schmidt,
Alberto Garcia-Ortiz,
Pedro Lopez-Garcia,
Rémy Haemmerlé,
Maximiliano Klemen,
Umer Liqat,
Manuel V. Hermenegildo,
Radim Vavřík,
Albert Saà-Garriga,
David Castells-Rufas,
Jordi Carrabina
Abstract:
Proceedings of the Workshop on High Performance Energy Efficient Embedded Systems (HIP3ES) 2015. Amsterdam, January 21st. Collocated with HIPEAC 2015 Conference.
Proceedings of the Workshop on High Performance Energy Efficient Embedded Systems (HIP3ES) 2015. Amsterdam, January 21st. Collocated with HIPEAC 2015 Conference.
△ Less
Submitted 13 January, 2015;
originally announced January 2015.