Search | arXiv e-print repository

A Fully Automated Platform for Evaluating ReRAM Crossbars

Authors: Rebecca Pelke, Felix Staudigl, Niklas Thomas, Nils Bosbach, Mohammed Hossein, Jose Cubero-Cascante, Leticia Bolzani Poehls, Rainer Leupers, Jan Moritz Joseph

Abstract: Resistive Random Access Memory (ReRAM) is a promising candidate for implementing Computing-in-Memory (CIM) architectures and neuromorphic circuits. ReRAM cells exhibit significant variability across different memristive devices and cycles, necessitating further improvements in the areas of devices, algorithms, and applications. To achieve this, understanding the stochastic behavior of the differen… ▽ More Resistive Random Access Memory (ReRAM) is a promising candidate for implementing Computing-in-Memory (CIM) architectures and neuromorphic circuits. ReRAM cells exhibit significant variability across different memristive devices and cycles, necessitating further improvements in the areas of devices, algorithms, and applications. To achieve this, understanding the stochastic behavior of the different ReRAM technologies is essential. The NeuroBreakoutBoard (NBB) is a versatile instrumentation platform to characterize Non-Volatile Memories (NVMs). However, the NBB itself does not provide any functionality in the form of software or a controller. In this paper, we present a control board for the NBB able to perform reliability assessments of 1T1R ReRAM crossbars. In more detail, an interface that allows a host PC to communicate with the NBB via the new control board is implemented. In a case study, we analyze the Cycle-to-Cycle (C2C) variation and read disturb TiN/Ti/HfO2/TiN cells for different read voltages to gain an understanding of their operational behavior. △ Less

Submitted 20 March, 2024; originally announced March 2024.

arXiv:2401.17819 [pdf, ps, other]

QTFlow: Quantitative Timing-Sensitive Information Flow for Security-Aware Hardware Design on RTL

Authors: Lennart M. Reimann, Anshul Prashar, Chiara Ghinami, Rebecca Pelke, Dominik Sisejkovic, Farhad Merchant, Rainer Leupers

Abstract: In contemporary Electronic Design Automation (EDA) tools, security often takes a backseat to the primary goals of power, performance, and area optimization. Commonly, the security analysis is conducted by hand, leading to vulnerabilities in the design remaining unnoticed. Security-aware EDA tools assist the designer in the identification and removal of security threats while kee** performance an… ▽ More In contemporary Electronic Design Automation (EDA) tools, security often takes a backseat to the primary goals of power, performance, and area optimization. Commonly, the security analysis is conducted by hand, leading to vulnerabilities in the design remaining unnoticed. Security-aware EDA tools assist the designer in the identification and removal of security threats while kee** performance and area in mind. Cutting-edge methods employ information flow analysis to identify inadvertent information leaks in design structures. Current information leakage detection methods use quantitative information flow analysis to quantify the leaks. However, handling sequential circuits poses challenges for state-of-the-art techniques due to their time-agnostic nature, overlooking timing channels, and introducing false positives. To address this, we introduce QTFlow, a timing-sensitive framework for quantifying hardware information leakages during the design phase. Illustrating its effectiveness on open-source benchmarks, QTFlow autonomously identifies timing channels and diminishes all false positives arising from time-agnostic analysis when contrasted with current state-of-the-art techniques. △ Less

Submitted 6 February, 2024; v1 submitted 31 January, 2024; originally announced January 2024.

Comments: accepted at IEEE VLSI-DAT 2024, Taiwan; 4 pages

arXiv:2401.07671 [pdf, other]

CLSA-CIM: A Cross-Layer Scheduling Approach for Computing-in-Memory Architectures

Authors: Rebecca Pelke, Jose Cubero-Cascante, Nils Bosbach, Felix Staudigl, Rainer Leupers, Jan Moritz Joseph

Abstract: The demand for efficient machine learning (ML) accelerators is growing rapidly, driving the development of novel computing concepts such as resistive random access memory (RRAM)-based tiled computing-in-memory (CIM) architectures. CIM allows to compute within the memory unit, resulting in faster data processing and reduced power consumption. Efficient compiler algorithms are essential to exploit t… ▽ More The demand for efficient machine learning (ML) accelerators is growing rapidly, driving the development of novel computing concepts such as resistive random access memory (RRAM)-based tiled computing-in-memory (CIM) architectures. CIM allows to compute within the memory unit, resulting in faster data processing and reduced power consumption. Efficient compiler algorithms are essential to exploit the potential of tiled CIM architectures. While conventional ML compilers focus on code generation for CPUs, GPUs, and other von Neumann architectures, adaptations are needed to cover CIM architectures. Cross-layer scheduling is a promising approach, as it enhances the utilization of CIM cores, thereby accelerating computations. Although similar concepts are implicitly used in previous work, there is a lack of clear and quantifiable algorithmic definitions for cross-layer scheduling for tiled CIM architectures. To close this gap, we present CLSA-CIM, a cross-layer scheduling algorithm for tiled CIM architectures. We integrate CLSA-CIM with existing weight-map** strategies and compare performance against state-of-the-art (SOTA) scheduling algorithms. CLSA-CIM improves the utilization by up to 17.9 x , resulting in an overall speedup increase of up to 29.2 x compared to SOTA. △ Less

Submitted 17 January, 2024; v1 submitted 15 January, 2024; originally announced January 2024.

arXiv:2309.03805 [pdf, other]

Map** of CNNs on multi-core RRAM-based CIM architectures

Authors: Rebecca Pelke, Nils Bosbach, Jose Cubero, Felix Staudigl, Rainer Leupers, Jan Moritz Joseph

Abstract: RRAM-based multi-core systems improve the energy efficiency and performance of CNNs. Thereby, the distributed parallel execution of convolutional layers causes critical data dependencies that limit the potential speedup. This paper presents synchronization techniques for parallel inference of convolutional layers on RRAM-based CIM architectures. We propose an architecture optimization that enables… ▽ More RRAM-based multi-core systems improve the energy efficiency and performance of CNNs. Thereby, the distributed parallel execution of convolutional layers causes critical data dependencies that limit the potential speedup. This paper presents synchronization techniques for parallel inference of convolutional layers on RRAM-based CIM architectures. We propose an architecture optimization that enables efficient data exchange and discuss the impact of different architecture setups on the performance. The corresponding compiler algorithms are optimized for high speedup and low memory consumption during CNN inference. We achieve more than 99% of the theoretical acceleration limit with a marginal data transmission overhead of less than 4% for state-of-the-art CNN benchmarks. △ Less

Submitted 26 October, 2023; v1 submitted 7 September, 2023; originally announced September 2023.

arXiv:2308.09445 [pdf, other]

doi 10.1007/978-3-031-46077-7_12

parti-gem5: gem5's Timing Mode Parallelised

Authors: José Cubero-Cascante, Niko Zurstraßen, Jörn Nöller, Rainer Leupers, Jan Moritz Joseph

Abstract: Detailed timing models are indispensable tools for the design space exploration of Multiprocessor Systems on Chip (MPSoCs). As core counts continue to increase, the complexity in memory hierarchies and interconnect topologies is also growing, making accurate predictions of design decisions more challenging than ever. In this context, the open-source Full System Simulator (FSS) gem5 is a popular ch… ▽ More Detailed timing models are indispensable tools for the design space exploration of Multiprocessor Systems on Chip (MPSoCs). As core counts continue to increase, the complexity in memory hierarchies and interconnect topologies is also growing, making accurate predictions of design decisions more challenging than ever. In this context, the open-source Full System Simulator (FSS) gem5 is a popular choice for MPSoC design space exploration, thanks to its flexibility and robust set of detailed timing models. However, its single-threaded simulation kernel severely hampers its throughput. To address this challenge, we introduce parti-gem5, an extension of gem5 that enables parallel timing simulations on modern multi-core simulation hosts. Unlike previous works, parti-gem5 supports gem5's timing mode, the O3CPU, and Ruby's custom cache and interconnect models. Compared to reference single-thread simulations, we achieved speedups of up to 42.7x when simulating a 120-core ARM MPSoC on a 64-core x86-64 host system. While our method introduces timing deviations, the error in total simulated time is below 15% in most cases. △ Less

Submitted 13 May, 2024; v1 submitted 18 August, 2023; originally announced August 2023.

Comments: Pre-print of work presented at SAMOS Conference XXIII

ACM Class: I.6.0

arXiv:2308.02694 [pdf, ps, other]

SoftFlow: Automated HW-SW Confidentiality Verification for Embedded Processors

Authors: Lennart M. Reimann, Jonathan Wiesner, Dominik Sisejkovic, Farhad Merchant, Rainer Leupers

Abstract: Despite its ever-increasing impact, security is not considered as a design objective in commercial electronic design automation (EDA) tools. This results in vulnerabilities being overlooked during the software-hardware design process. Specifically, vulnerabilities that allow leakage of sensitive data might stay unnoticed by standard testing, as the leakage itself might not result in evident functi… ▽ More Despite its ever-increasing impact, security is not considered as a design objective in commercial electronic design automation (EDA) tools. This results in vulnerabilities being overlooked during the software-hardware design process. Specifically, vulnerabilities that allow leakage of sensitive data might stay unnoticed by standard testing, as the leakage itself might not result in evident functional changes. Therefore, EDA tools are needed to elaborate the confidentiality of sensitive data during the design process. However, state-of-the-art implementations either solely consider the hardware or restrict the expressiveness of the security properties that must be proven. Consequently, more proficient tools are required to assist in the software and hardware design. To address this issue, we propose SoftFlow, an EDA tool that allows determining whether a given software exploits existing leakage paths in hardware. Based on our analysis, the leakage paths can be retained if proven not to be exploited by software. This is desirable if the removal significantly impacts the design's performance or functionality, or if the path cannot be removed as the chip is already manufactured. We demonstrate the feasibility of SoftFlow by identifying vulnerabilities in OpenSSL cryptographic C programs, and redesigning them to avoid leakage of cryptographic keys in a RISC-V architecture. △ Less

Submitted 4 August, 2023; originally announced August 2023.

Comments: 6 pages, accepted at 31st IFIP/IEEE Conference on Very Large Scale Integration (VLSI-SoC 2023)

arXiv:2308.02400 [pdf, other]

Work-in-Progress: A Universal Instrumentation Platform for Non-Volatile Memories

Authors: Felix Staudigl, Mohammed Hossein, Tobias Ziegler, Hazem Al Indari, Rebecca Pelke, Sebastian Siegel, Dirk J. Wouters, Dominik Sisejkovic, Jan Moritz Joseph, Rainer Leupers

Abstract: Emerging non-volatile memories (NVMs) represent a disruptive technology that allows a paradigm shift from the conventional von Neumann architecture towards more efficient computing-in-memory (CIM) architectures. Several instrumentation platforms have been proposed to interface NVMs allowing the characterization of single cells and crossbar structures. However, these platforms suffer from low flexi… ▽ More Emerging non-volatile memories (NVMs) represent a disruptive technology that allows a paradigm shift from the conventional von Neumann architecture towards more efficient computing-in-memory (CIM) architectures. Several instrumentation platforms have been proposed to interface NVMs allowing the characterization of single cells and crossbar structures. However, these platforms suffer from low flexibility and are not capable of performing CIM operations on NVMs. Therefore, we recently designed and built the NeuroBreakoutBoard, a highly versatile instrumentation platform capable of executing CIM on NVMs. We present our preliminary results demonstrating a relative error < 5% in the range of 1 k$Ω$ to 1 M$Ω$ and showcase the switching behavior of a HfO$_2$/Ti-based memristive cell. △ Less

Submitted 3 August, 2023; originally announced August 2023.

arXiv:2304.05686 [pdf, other]

Gate Camouflaging Using Reconfigurable ISFET-Based Threshold Voltage Defined Logic

Authors: Elmira Moussavi, Animesh Singh, Dominik Sisejkovic, Aravind Padma Kumar, Daniyar Kizatov, Sven Ingebrandt, Rainer Leupers, Vivek Pachauri, Farhad Merchant

Abstract: Most chip designers outsource the manufacturing of their integrated circuits (ICs) to external foundries due to the exorbitant cost and complexity of the process. This involvement of untrustworthy, external entities opens the door to major security threats, such as reverse engineering (RE). RE can reveal the physical structure and functionality of intellectual property (IP) and ICs, leading to IP… ▽ More Most chip designers outsource the manufacturing of their integrated circuits (ICs) to external foundries due to the exorbitant cost and complexity of the process. This involvement of untrustworthy, external entities opens the door to major security threats, such as reverse engineering (RE). RE can reveal the physical structure and functionality of intellectual property (IP) and ICs, leading to IP theft, counterfeiting, and other misuses. The concept of the threshold voltage-defined (TVD) logic family is a potential mechanism to obfuscate and protect the design and prevent RE. However, it addresses post-fabrication RE issues, and it has been shown that dopant profiling techniques can be used to determine the threshold voltage of the transistor and break the obfuscation. In this work, we propose a novel TVD modulation with ion-sensitive field-effect transistors (ISFETs) to protect the IC from RE and IP piracy. Compared to the conventional TVD logic family, ISFET-TVD allows post-manufacture programming. The ISFET-TVD logic gate can be reconfigured after fabrication, maintaining an exact schematic architecture with an identical layout for all types of logic gates, and thus overcoming the shortcomings of the classic TVD. The threshold voltage of the ISFETs can be adjusted after fabrication by changing the ion concentration of the material in contact with the ion-sensitive gate of the transistor, depending on the Boolean functionality. The ISFET is CMOS compatible, and therefore implemented on 45 nm CMOS technology for demonstration. △ Less

Submitted 12 April, 2023; originally announced April 2023.

arXiv:2304.05682 [pdf, ps, other]

Automated Information Flow Analysis for Integrated Computing-in-Memory Modules

Authors: Lennart M. Reimann, Felix Staudigl, Rainer Leupers

Abstract: Novel non-volatile memory (NVM) technologies offer high-speed and high-density data storage. In addition, they overcome the von Neumann bottleneck by enabling computing-in-memory (CIM). Various computer architectures have been proposed to integrate CIM blocks in their design, forming a mixed-signal system to combine the computational benefits of CIM with the robustness of conventional CMOS. Novel… ▽ More Novel non-volatile memory (NVM) technologies offer high-speed and high-density data storage. In addition, they overcome the von Neumann bottleneck by enabling computing-in-memory (CIM). Various computer architectures have been proposed to integrate CIM blocks in their design, forming a mixed-signal system to combine the computational benefits of CIM with the robustness of conventional CMOS. Novel electronic design automation (EDA) tools are necessary to design and manufacture these so-called neuromorphic systems. Furthermore, EDA tools must consider the impact of security vulnerabilities, as hardware security attacks have increased in recent years. Existing information flow analysis (IFA) frameworks offer an automated tool-suite to uphold the confidentiality property for sensitive data during the design of hardware. However, currently available mixed-signal EDA tools are not capable of analyzing the information flow of neuromorphic systems. To illustrate the shortcomings, we develop information flow protocols for NVMs that can be easily integrated in the already existing tool-suites. We show the limitation of the state-of-the-art by analyzing the flow from sensitive signals through multiple memristive crossbar structures to potential untrusted components and outputs. Finally, we provide a thorough discussion of the merits and flaws of the mixed-signal IFA frameworks on neuromorphic systems. △ Less

Submitted 12 April, 2023; originally announced April 2023.

Comments: 5 pages, accepted at 21st IEEE Interregional NEWCAS Conference, Edinburgh, Scotland

arXiv:2302.07655 [pdf, other]

Fault Injection in Native Logic-in-Memory Computation on Neuromorphic Hardware

Authors: Felix Staudigl, Thorben Fetz, Rebecca Pelke, Dominik Sisejkovic, Jan Moritz Joseph, Leticia Bolzani Pöhls, Rainer Leupers

Abstract: Logic-in-memory (LIM) describes the execution of logic gates within memristive crossbar structures, promising to improve performance and energy efficiency. Utilizing only binary values, LIM particularly excels in accelerating binary neural networks, shifting it in the focus of edge applications. Considering its potential, the impact of faults on BNNs accelerated with LIM still lacks investigation.… ▽ More Logic-in-memory (LIM) describes the execution of logic gates within memristive crossbar structures, promising to improve performance and energy efficiency. Utilizing only binary values, LIM particularly excels in accelerating binary neural networks, shifting it in the focus of edge applications. Considering its potential, the impact of faults on BNNs accelerated with LIM still lacks investigation. In this paper, we propose faulty logic-in-memory (FLIM), a fault injection platform capable of executing full-fledged BNNs on LIM while injecting in-field faults. The results show that FLIM runs a single MNIST picture 66754x faster than the state of the art by offering a fine-grained fault injection methodology. △ Less

Submitted 15 February, 2023; originally announced February 2023.

arXiv:2211.16891 [pdf, ps, other]

Quantitative Information Flow for Hardware: Advancing the Attack Landscape

Authors: Lennart M. Reimann, Sarp Erdönmez, Dominik Sisejkovic, Rainer Leupers

Abstract: Security still remains an afterthought in modern Electronic Design Automation (EDA) tools, which solely focus on enhancing performance and reducing the chip size. Typically, the security analysis is conducted by hand, leading to vulnerabilities in the design remaining unnoticed. Security-aware EDA tools assist the designer in the identification and removal of security threats while kee** perform… ▽ More Security still remains an afterthought in modern Electronic Design Automation (EDA) tools, which solely focus on enhancing performance and reducing the chip size. Typically, the security analysis is conducted by hand, leading to vulnerabilities in the design remaining unnoticed. Security-aware EDA tools assist the designer in the identification and removal of security threats while kee** performance and area in mind. State-of-the-art approaches utilize information flow analysis to spot unintended information leakages in design structures. However, the classification of such threats is binary, resulting in negligible leakages being listed as well. A novel quantitative analysis allows the application of a metric to determine a numeric value for a leakage. Nonetheless, current approximations to quantify the leakage are still prone to overlooking leakages. The mathematical model 2D-QModel introduced in this work aims to overcome this shortcoming. Additionally, as previous work only includes a limited threat model, multiple threat models can be applied using the provided approach. Open-source benchmarks are used to show the capabilities of 2D-QModel to identify hardware Trojans in the design while ignoring insignificant leakages. △ Less

Submitted 30 November, 2022; originally announced November 2022.

Comments: 4 pages, accepted at IEEE Latin American Symposium on Circuits and Systems (LASCAS), 2023

arXiv:2208.04769 [pdf, other]

A Temperature Independent Readout Circuit for ISFET-Based Sensor Applications

Authors: Elmira Moussavi, Dominik Sisejkovic, Animesh Singh, Daniyar Kizatov, Rainer Leupers, Sven Ingebrandt, Vivek Pachauri, Farhad Merchant

Abstract: The ion-sensitive field-effect transistor (ISFET) is an emerging technology that has received much attention in numerous research areas, including biochemistry, medicine, and security applications. However, compared to other types of sensors, the complexity of ISFETs make it more challenging to achieve a sensitive, fast and repeatable response. Therefore, various readout circuits have been develop… ▽ More The ion-sensitive field-effect transistor (ISFET) is an emerging technology that has received much attention in numerous research areas, including biochemistry, medicine, and security applications. However, compared to other types of sensors, the complexity of ISFETs make it more challenging to achieve a sensitive, fast and repeatable response. Therefore, various readout circuits have been developed to improve the performance of ISFETs, especially to eliminate the temperature effect. This paper presents a new approach for a temperature-independent readout circuit that uses the threshold voltage differences of an ISFET-MOSFET pair. The Linear Technology Simulation Program with Integrated Circuit Emphasis (LTspice) is used to analyze the ISFET performance based on the proposed readout circuit characteristics. A macro-model is used to model ISFET behavior, including the first-level Spice model for the MOSFET part and Verilog-A to model the surface potential, reference electrode, and electrolyte of the ISFET to determine the relationships between variables.In this way, the behavior of the ISFET is monitored by the output voltage of the readout circuit based on a change in the electrolyte's hydrogen potential (pH), determined by the simulation. The proposed readout circuit has a temperature coefficient of 11.9 $ppm/°C$ for a temperature range of 0-100 $°C$ and pH between 1 and 13. The proposed ISFET readout circuit outperforms other designs in terms of simplicity and not requiring an additional sensor. △ Less

Submitted 9 August, 2022; originally announced August 2022.

Comments: 4pages, 6 figures, Accepted in LATS 2022

arXiv:2207.11036 [pdf, other]

NISTT: A Non-Intrusive SystemC-TLM 2.0 Tracing Tool

Authors: Nils Bosbach, Lukas Jünger, Jan Moritz Joseph, Rainer Leupers

Abstract: The increasing complexity of systems-on-a-chip requires the continuous development of electronic design automation tools. Nowadays, the simulation of systems-on-a-chip using virtual platforms is common. Virtual platforms enable hardware/software co-design to shorten the time to market, offer insights into the models, and allow debugging of the simulated hardware. Profiling tools are required to im… ▽ More The increasing complexity of systems-on-a-chip requires the continuous development of electronic design automation tools. Nowadays, the simulation of systems-on-a-chip using virtual platforms is common. Virtual platforms enable hardware/software co-design to shorten the time to market, offer insights into the models, and allow debugging of the simulated hardware. Profiling tools are required to improve the usability of virtual platforms. During simulation, these tools capture data that are evaluated afterward. Those data can reveal information about the simulation itself and the software executed on the platform. This work presents the tracing tool NISTT that can profile SystemC-TLM-2.0-based virtual platforms. NISTT is implemented in a completely non-intrusive way. That means no changes in the simulation are needed, the source code of the simulation is not required, and the traced simulation does not need to contain debug symbols. The standardized SystemC application programming interface guarantees the compatibility of NISTT with other simulations. The strengths of NISTT are demonstrated in a case study. Here, NISTT is connected to a virtual platform and traces the boot process of Linux. After the simulation, the database created by NISTT is evaluated, and the results are visualized. Furthermore, the overhead of NISTT is quantified. It is shown that NISTT has only a minor influence on the overall simulation performance. △ Less

Submitted 22 July, 2022; originally announced July 2022.

Comments: PREPRINT - accepted by 30th IFIP/IEEE International Conference on Very Large Scale Integration 2022 (VLSI-SoC 2022)

arXiv:2207.10526 [pdf, other]

PA-PUF: A Novel Priority Arbiter PUF

Authors: Simranjeet Singh, Srinivasu Bodapati, Sachin Patkar, Rainer Leupers, Anupam Chattopadhyay, Farhad Merchant

Abstract: This paper proposes a 3-input arbiter-based novel physically unclonable function (PUF) design. Firstly, a 3-input priority arbiter is designed using a simple arbiter, two multiplexers (2:1), and an XOR logic gate. The priority arbiter has an equal probability of 0's and 1's at the output, which results in excellent uniformity (49.45%) while retrieving the PUF response. Secondly, a new PUF design b… ▽ More This paper proposes a 3-input arbiter-based novel physically unclonable function (PUF) design. Firstly, a 3-input priority arbiter is designed using a simple arbiter, two multiplexers (2:1), and an XOR logic gate. The priority arbiter has an equal probability of 0's and 1's at the output, which results in excellent uniformity (49.45%) while retrieving the PUF response. Secondly, a new PUF design based on priority arbiter PUF (PA-PUF) is presented. The PA-PUF design is evaluated for uniqueness, non-linearity, and uniformity against the standard tests. The proposed PA-PUF design is configurable in challenge-response pairs through an arbitrary number of feed-forward priority arbiters introduced to the design. We demonstrate, through extensive experiments, reliability of 100% after performing the error correction techniques and uniqueness of 49.63%. Finally, the design is compared with the literature to evaluate its implementation efficiency, where it is clearly found to be superior compared to the state-of-the-art. △ Less

Submitted 21 July, 2022; originally announced July 2022.

arXiv:2206.11613 [pdf, other]

EmuNoC: Hybrid Emulation for Fast and Flexible Network-on-Chip Prototy** on FPGAs

Authors: Yee Yang Tan, Felix Staudigl, Lukas Jünger, Anna Drewes, Rainer Leupers, Jan Moritz Joseph

Abstract: Networks-on-Chips (NoCs) recently became widely used, from multi-core CPUs to edge-AI accelerators. Emulation on FPGAs promises to accelerate their RTL modeling compared to slow simulations. However, realistic test stimuli are challenging to generate in hardware for diverse applications. In other words, both a fast and flexible design framework is required. The most promising solution is hybrid em… ▽ More Networks-on-Chips (NoCs) recently became widely used, from multi-core CPUs to edge-AI accelerators. Emulation on FPGAs promises to accelerate their RTL modeling compared to slow simulations. However, realistic test stimuli are challenging to generate in hardware for diverse applications. In other words, both a fast and flexible design framework is required. The most promising solution is hybrid emulation, in which parts of the design are simulated in software, and the other parts are emulated in hardware. This paper proposes a novel hybrid emulation framework called EmuNoC. We introduce a clock-synchronization method and software-only packet generation that improves the emulation speed by 36.3x to 79.3x over state-of-the-art frameworks while retaining the flexibility of a pure-software interface for stimuli simulation. We also increased the area efficiency to model up to an NoC with 169 routers on a single FPGA, while previous frameworks only achieved 64 routers. △ Less

Submitted 23 June, 2022; originally announced June 2022.

arXiv:2204.01501 [pdf, other]

X-Fault: Impact of Faults on Binary Neural Networks in Memristor-Crossbar Arrays with Logic-in-Memory Computation

Authors: Felix Staudigl, Karl J. X. Sturm, Maximilian Bartel, Thorben Fetz, Dominik Sisejkovic, Jan Moritz Joseph, Leticia Bolzani Pöhls, Rainer Leupers

Abstract: Memristor-based crossbar arrays represent a promising emerging memory technology to replace conventional memories by offering a high density and enabling computing-in-memory (CIM) paradigms. While analog computing provides the best performance, non-idealities and ADC/DAC conversion limit memristor-based CIM. Logic-in-Memory (LIM) presents another flavor of CIM, in which the memristors are used in… ▽ More Memristor-based crossbar arrays represent a promising emerging memory technology to replace conventional memories by offering a high density and enabling computing-in-memory (CIM) paradigms. While analog computing provides the best performance, non-idealities and ADC/DAC conversion limit memristor-based CIM. Logic-in-Memory (LIM) presents another flavor of CIM, in which the memristors are used in a binary manner to implement logic gates. Since binary neural networks (BNNs) use binary logic gates as the dominant operation, they can benefit from the massively parallel execution of binary operations and better resilience to variations of the memristors. Although conventional neural networks have been thoroughly investigated, the impact of faults on memristor-based BNNs remains unclear. Therefore, we analyze the impact of faults on logic gates in memristor-based crossbar arrays for BNNs. We propose a simulation framework that simulates different traditional faults to examine the accuracy loss of BNNs on memristive crossbar arrays. In addition, we compare different logic families based on the robustness and feasibility to accelerate AI applications. △ Less

Submitted 4 April, 2022; originally announced April 2022.

arXiv:2203.05399 [pdf, other]

Designing ML-Resilient Locking at Register-Transfer Level

Authors: Dominik Sisejkovic, Luca Collini, Benjamin Tan, Christian Pilato, Ramesh Karri, Rainer Leupers

Abstract: Various logic-locking schemes have been proposed to protect hardware from intellectual property piracy and malicious design modifications. Since traditional locking techniques are applied on the gate-level netlist after logic synthesis, they have no semantic knowledge of the design function. Data-driven, machine-learning (ML) attacks can uncover the design flaws within gate-level locking. Recent p… ▽ More Various logic-locking schemes have been proposed to protect hardware from intellectual property piracy and malicious design modifications. Since traditional locking techniques are applied on the gate-level netlist after logic synthesis, they have no semantic knowledge of the design function. Data-driven, machine-learning (ML) attacks can uncover the design flaws within gate-level locking. Recent proposals on register-transfer level (RTL) locking have access to semantic hardware information. We investigate the resilience of ASSURE, a state-of-the-art RTL locking method, against ML attacks. We used the lessons learned to derive two ML-resilient RTL locking schemes built to reinforce ASSURE locking. We developed ML-driven security metrics to evaluate the schemes against an RTL adaptation of the state-of-the-art, ML-based SnapShot attack. △ Less

Submitted 6 April, 2022; v1 submitted 10 March, 2022; originally announced March 2022.

Comments: Proceedings of the 59th ACM/IEEE Design Automation Conference (DAC '22)

arXiv:2202.12085 [pdf, other]

pHGen: A pH-Based Key Generation Mechanism Using ISFETs

Authors: Elmira Moussavi, Dominik Sisejkovic, Fabian Brings, Daniyar Kizatov, Animesh Singh, Xuan Thang Vu, Sven Ingebrandt, Rainer Leupers, Vivek Pachauri, Farhad Merchant

Abstract: Digital keys are a fundamental component of many hardware- and software-based security mechanisms. However, digital keys are limited to binary values and easily exploitable when stored in standard memories. In this paper, based on emerging technologies, we introduce pHGen, a potential-of-hydrogen (pH)-based key generation mechanism that leverages chemical reactions in the form of a potential chang… ▽ More Digital keys are a fundamental component of many hardware- and software-based security mechanisms. However, digital keys are limited to binary values and easily exploitable when stored in standard memories. In this paper, based on emerging technologies, we introduce pHGen, a potential-of-hydrogen (pH)-based key generation mechanism that leverages chemical reactions in the form of a potential change in ion-sensitive field-effect transistors (ISFETs). The threshold voltage of ISFETs is manipulated corresponding to a known pH buffer solution (key) in which the transistors are immersed. To read the chemical information effectively via ISFETs, we designed a readout circuit for stable operation and detection of voltage thresholds. To demonstrate the applicability of the proposed key generation, we utilize pHGen for logic locking -- a hardware integrity protection scheme. The proposed key-generation method breaks the limits of binary values and provides the first steps toward the utilization of multi-valued voltage thresholds of ISFETs controlled by chemical information. The pHGen approach is expected to be a turning point for using more sophisticated bio-based analog keys for securing next-generation electronics. △ Less

Submitted 24 February, 2022; originally announced February 2022.

Comments: Accepted in HOST 2022

arXiv:2112.13157 [pdf, other]

A Parallel SystemC Virtual Platform for Neuromorphic Architectures

Authors: Melvin Galicia, Farhad Merchant, Rainer Leupers

Abstract: With the increasing interest in neuromorphic computing, designers of embedded systems face the challenge of efficiently simulating such platforms to enable architecture design exploration early in the development cycle. Executing artificial neural network applications on neuromorphic systems which are being simulated on virtual platforms (VPs) is an extremely demanding computational task. Neverthe… ▽ More With the increasing interest in neuromorphic computing, designers of embedded systems face the challenge of efficiently simulating such platforms to enable architecture design exploration early in the development cycle. Executing artificial neural network applications on neuromorphic systems which are being simulated on virtual platforms (VPs) is an extremely demanding computational task. Nevertheless, it is a vital benchmarking task for comparing different possible architectures. Therefore, exploiting the multicore capabilities of the VP's host system is essential to achieve faster simulations. Hence, this paper presents a parallel SystemC based VP for RISC-V multicore platforms integrating multiple computing-in-memory neuromorphic accelerators. In this paper, different VP segmentation architectures are explored for the integration of neuromorphic accelerators and are shown their corresponding speedup simulations compared to conventional sequential SystemC execution. △ Less

Submitted 24 December, 2021; originally announced December 2021.

Comments: Accepted at 23rd International Symposium on Quality Electronic Design (ISQED'22)

arXiv:2112.01087 [pdf, ps, other]

doi 10.23919/DATE54114.2022.9774651

NeuroHammer: Inducing Bit-Flips in Memristive Crossbar Memories

Authors: Felix Staudigl, Hazem Al Indari, Daniel Schön, Dominik Sisejkovic, Farhad Merchant, Jan Moritz Joseph, Vikas Rana, Stephan Menzel, Rainer Leupers

Abstract: Emerging non-volatile memory (NVM) technologies offer unique advantages in energy efficiency, latency, and features such as computing-in-memory. Consequently, emerging NVM technologies are considered an ideal substrate for computation and storage in future-generation neuromorphic platforms. These technologies need to be evaluated for fundamental reliability and security issues. In this paper, we p… ▽ More Emerging non-volatile memory (NVM) technologies offer unique advantages in energy efficiency, latency, and features such as computing-in-memory. Consequently, emerging NVM technologies are considered an ideal substrate for computation and storage in future-generation neuromorphic platforms. These technologies need to be evaluated for fundamental reliability and security issues. In this paper, we present \emph{NeuroHammer}, a security threat in ReRAM crossbars caused by thermal crosstalk between memory cells. We demonstrate that bit-flips can be deliberately induced in ReRAM devices in a crossbar by systematically writing adjacent memory cells. A simulation flow is developed to evaluate NeuroHammer and the impact of physical parameters on the effectiveness of the attack. Finally, we discuss the security implications in the context of possible attack scenarios. △ Less

Submitted 6 December, 2021; v1 submitted 2 December, 2021; originally announced December 2021.

arXiv:2109.02379 [pdf, other]

doi 10.1109/ICCD53106.2021.00097

QFlow: Quantitative Information Flow for Security-Aware Hardware Design in Verilog

Authors: Lennart M. Reimann, Luca Hanel, Dominik Sisejkovic, Farhad Merchant, Rainer Leupers

Abstract: The enormous amount of code required to design modern hardware implementations often leads to critical vulnerabilities being overlooked. Especially vulnerabilities that compromise the confidentiality of sensitive data, such as cryptographic keys, have a major impact on the trustworthiness of an entire system. Information flow analysis can elaborate whether information from sensitive signals flows… ▽ More The enormous amount of code required to design modern hardware implementations often leads to critical vulnerabilities being overlooked. Especially vulnerabilities that compromise the confidentiality of sensitive data, such as cryptographic keys, have a major impact on the trustworthiness of an entire system. Information flow analysis can elaborate whether information from sensitive signals flows towards outputs or untrusted components of the system. But most of these analytical strategies rely on the non-interference property, stating that the untrusted targets must not be influenced by the source's data, which is shown to be too inflexible for many applications. To address this issue, there are approaches to quantify the information flow between components such that insignificant leakage can be neglected. Due to the high computational complexity of this quantification, approximations are needed, which introduce mispredictions. To tackle those limitations, we reformulate the approximations. Further, we propose a tool QFlow with a higher detection rate than previous tools. It can be used by non-experienced users to identify data leakages in hardware designs, thus facilitating a security-aware design process. △ Less

Submitted 22 December, 2021; v1 submitted 6 September, 2021; originally announced September 2021.

Comments: 5 pages, accepted at International Conference on Computer Design 2021 (ICCD)

Journal ref: 2021 IEEE 39th International Conference on Computer Design (ICCD)

arXiv:2107.08695 [pdf, other]

doi 10.1109/TCAD.2021.3100275

Deceptive Logic Locking for Hardware Integrity Protection against Machine Learning Attacks

Authors: Dominik Sisejkovic, Farhad Merchant, Lennart M. Reimann, Rainer Leupers

Abstract: Logic locking has emerged as a prominent key-driven technique to protect the integrity of integrated circuits. However, novel machine-learning-based attacks have recently been introduced to challenge the security foundations of locking schemes. These attacks are able to recover a significant percentage of the key without having access to an activated circuit. This paper address this issue through… ▽ More Logic locking has emerged as a prominent key-driven technique to protect the integrity of integrated circuits. However, novel machine-learning-based attacks have recently been introduced to challenge the security foundations of locking schemes. These attacks are able to recover a significant percentage of the key without having access to an activated circuit. This paper address this issue through two focal points. First, we present a theoretical model to test locking schemes for key-related structural leakage that can be exploited by machine learning. Second, based on the theoretical model, we introduce D-MUX: a deceptive multiplexer-based logic-locking scheme that is resilient against structure-exploiting machine learning attacks. Through the design of D-MUX, we uncover a major fallacy in existing multiplexer-based locking schemes in the form of a structural-analysis attack. Finally, an extensive cost evaluation of D-MUX is presented. To the best of our knowledge, D-MUX is the first machine-learning-resilient locking scheme capable of protecting against all known learning-based attacks. Hereby, the presented work offers a starting point for the design and evaluation of future-generation logic locking in the era of machine learning. △ Less

Submitted 19 July, 2021; originally announced July 2021.

Comments: Accepted at IEEE TCAD 2021

Journal ref: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), July, 2021

arXiv:2107.01915 [pdf, ps, other]

doi 10.1109/VLSI-SoC53125.2021.9606979

Logic Locking at the Frontiers of Machine Learning: A Survey on Developments and Opportunities

Authors: Dominik Sisejkovic, Lennart M. Reimann, Elmira Moussavi, Farhad Merchant, Rainer Leupers

Abstract: In the past decade, a lot of progress has been made in the design and evaluation of logic locking; a premier technique to safeguard the integrity of integrated circuits throughout the electronics supply chain. However, the widespread proliferation of machine learning has recently introduced a new pathway to evaluating logic locking schemes. This paper summarizes the recent developments in logic lo… ▽ More In the past decade, a lot of progress has been made in the design and evaluation of logic locking; a premier technique to safeguard the integrity of integrated circuits throughout the electronics supply chain. However, the widespread proliferation of machine learning has recently introduced a new pathway to evaluating logic locking schemes. This paper summarizes the recent developments in logic locking attacks and countermeasures at the frontiers of contemporary machine learning models. Based on the presented work, the key takeaways, opportunities, and challenges are highlighted to offer recommendations for the design of next-generation logic locking. △ Less

Submitted 23 November, 2021; v1 submitted 5 July, 2021; originally announced July 2021.

Comments: 6 pages, 3 figures, accepted at VLSI-SOC 2021

Journal ref: 2021 IFIP/IEEE 29th International Conference on Very Large Scale Integration (VLSI-SoC)

arXiv:2101.06665 [pdf, other]

Brightening the Optical Flow through Posit Arithmetic

Authors: Vinay Saxena, Ankitha Reddy, Jonathan Neudorfer, John Gustafson, Sangeeth Nambiar, Rainer Leupers, Farhad Merchant

Abstract: As new technologies are invented, their commercial viability needs to be carefully examined along with their technical merits and demerits. The posit data format, proposed as a drop-in replacement for IEEE 754 float format, is one such invention that requires extensive theoretical and experimental study to identify products that can benefit from the advantages of posits for specific market segment… ▽ More As new technologies are invented, their commercial viability needs to be carefully examined along with their technical merits and demerits. The posit data format, proposed as a drop-in replacement for IEEE 754 float format, is one such invention that requires extensive theoretical and experimental study to identify products that can benefit from the advantages of posits for specific market segments. In this paper, we present an extensive empirical study of posit-based arithmetic vis-à-vis IEEE 754 compliant arithmetic for the optical flow estimation method called Lucas-Kanade (LuKa). First, we use SoftPosit and SoftFloat format emulators to perform an empirical error analysis of the LuKa method. Our study shows that the average error in LuKa with SoftPosit is an order of magnitude lower than LuKa with SoftFloat. We then present the integration of the hardware implementation of a posit adder and multiplier in a RISC-V open-source platform. We make several recommendations, along with the analysis of LuKa in the RISC-V context, for future generation platforms incorporating posit arithmetic units. △ Less

Submitted 17 January, 2021; originally announced January 2021.

Comments: To appear in ISQED 2021

arXiv:2101.05591 [pdf, other]

ANDROMEDA: An FPGA Based RISC-V MPSoC Exploration Framework

Authors: Farhad Merchant, Dominik Sisejkovic, Lennart M. Reimann, Kirthihan Yasotharan, Thomas Grass, Rainer Leupers

Abstract: With the growing demands of consumer electronic products, the computational requirements are increasing exponentially. Due to the applications' computational needs, the computer architects are trying to pack as many cores as possible on a single die for accelerated execution of the application program codes. In a multiprocessor system-on-chip (MPSoC), striking a balance among the number of cores,… ▽ More With the growing demands of consumer electronic products, the computational requirements are increasing exponentially. Due to the applications' computational needs, the computer architects are trying to pack as many cores as possible on a single die for accelerated execution of the application program codes. In a multiprocessor system-on-chip (MPSoC), striking a balance among the number of cores, memory subsystems, and network-on-chip parameters is essential to attain the desired performance. In this paper, we present ANDROMEDA, a RISC-V based framework that allows us to explore the different configurations of an MPSoC and observe the performance penalties and gains. We emulate the various configurations of MPSoC on the Synopsys HAPS-80D Dual FPGA platform. Using STREAM, matrix multiply, and N-body simulations as benchmarks, we demonstrate our framework's efficacy in quickly identifying the right parameters for efficient execution of these benchmarks. △ Less

Submitted 14 January, 2021; originally announced January 2021.

Comments: Accepted in VLSI Design 2021

arXiv:2101.01416 [pdf, other]

An Investigation on Inherent Robustness of Posit Data Representation

Authors: Ihsen Alouani, Anouar Ben Khalifa, Farhad Merchant, Rainer Leupers

Abstract: As the dimensions and operating voltages of computer electronics shrink to cope with consumers' demand for higher performance and lower power consumption, circuit sensitivity to soft errors increases dramatically. Recently, a new data-type is proposed in the literature called posit data type. Posit arithmetic has absolute advantages such as higher numerical accuracy, speed, and simpler hardware de… ▽ More As the dimensions and operating voltages of computer electronics shrink to cope with consumers' demand for higher performance and lower power consumption, circuit sensitivity to soft errors increases dramatically. Recently, a new data-type is proposed in the literature called posit data type. Posit arithmetic has absolute advantages such as higher numerical accuracy, speed, and simpler hardware design than IEEE 754-2008 technical standard-compliant arithmetic. In this paper, we propose a comparative robustness study between 32-bit posit and 32-bit IEEE 754-2008 compliant representations. At first, we propose a theoretical analysis for IEEE 754 compliant numbers and posit numbers for single bit flip and double bit flips. Then, we conduct exhaustive fault injection experiments that show a considerable inherent resilience in posit format compared to classical IEEE 754 compliant representation. To show a relevant use-case of fault-tolerant applications, we perform experiments on a set of machine-learning applications. In more than 95% of the exhaustive fault injection exploration, posit representation is less impacted by faults than the IEEE 754 compliant floating-point representation. Moreover, in 100% of the tested machine-learning applications, the accuracy of posit-implemented systems is higher than the classical floating-point-based ones. △ Less

Submitted 5 January, 2021; originally announced January 2021.

Comments: To appear in VLSID 2021

arXiv:2012.12563 [pdf, other]

Architecture, Dataflow and Physical Design Implications of 3D-ICs for DNN-Accelerators

Authors: Jan Moritz Joseph, Ananda Samajdar, Lingjun Zhu, Rainer Leupers, Sung-Kyu Lim, Thilo Pionteck, Tushar Krishna

Abstract: The everlasting demand for higher computing power for deep neural networks (DNNs) drives the development of parallel computing architectures. 3D integration, in which chips are integrated and connected vertically, can further increase performance because it introduces another level of spatial parallelism. Therefore, we analyze dataflows, performance, area, power and temperature of such 3D-DNN-acce… ▽ More The everlasting demand for higher computing power for deep neural networks (DNNs) drives the development of parallel computing architectures. 3D integration, in which chips are integrated and connected vertically, can further increase performance because it introduces another level of spatial parallelism. Therefore, we analyze dataflows, performance, area, power and temperature of such 3D-DNN-accelerators. Monolithic and TSV-based stacked 3D-ICs are compared against 2D-ICs. We identify workload properties and architectural parameters for efficient 3D-ICs and achieve up to 9.14x speedup of 3D vs. 2D. We discuss area-performance trade-offs. We demonstrate applicability as the 3D-IC draws similar power as 2D-ICs and is not thermal limited. △ Less

Submitted 18 February, 2021; v1 submitted 23 December, 2020; originally announced December 2020.

arXiv:2011.10389 [pdf, other]

doi 10.1145/3431389

Challenging the Security of Logic Locking Schemes in the Era of Deep Learning: A Neuroevolutionary Approach

Authors: Dominik Sisejkovic, Farhad Merchant, Lennart M. Reimann, Harshit Srivastava, Ahmed Hallawa, Rainer Leupers

Abstract: Logic locking is a prominent technique to protect the integrity of hardware designs throughout the integrated circuit design and fabrication flow. However, in recent years, the security of locking schemes has been thoroughly challenged by the introduction of various deobfuscation attacks. As in most research branches, deep learning is being introduced in the domain of logic locking as well. Theref… ▽ More Logic locking is a prominent technique to protect the integrity of hardware designs throughout the integrated circuit design and fabrication flow. However, in recent years, the security of locking schemes has been thoroughly challenged by the introduction of various deobfuscation attacks. As in most research branches, deep learning is being introduced in the domain of logic locking as well. Therefore, in this paper we present SnapShot: a novel attack on logic locking that is the first of its kind to utilize artificial neural networks to directly predict a key bit value from a locked synthesized gate-level netlist without using a golden reference. Hereby, the attack uses a simpler yet more flexible learning model compared to existing work. Two different approaches are evaluated. The first approach is based on a simple feedforward fully connected neural network. The second approach utilizes genetic algorithms to evolve more complex convolutional neural network architectures specialized for the given task. The attack flow offers a generic and customizable framework for attacking locking schemes using machine learning techniques. We perform an extensive evaluation of SnapShot for two realistic attack scenarios, comprising both reference benchmark circuits as well as silicon-proven RISC-V core modules. The evaluation results show that SnapShot achieves an average key prediction accuracy of 82.60% for the selected attack scenario, with a significant performance increase of 10.49 percentage points compared to the state of the art. Moreover, SnapShot outperforms the existing technique on all evaluated benchmarks. The results indicate that the security foundation of common logic locking schemes is build on questionable assumptions. The conclusions of the evaluation offer insights into the challenges of designing future logic locking schemes that are resilient to machine learning attacks. △ Less

Submitted 30 November, 2020; v1 submitted 20 November, 2020; originally announced November 2020.

Comments: 25 pages, 17 figures, accepted at ACM JETC

Journal ref: ACM J. Emerg. Technol. Comput. Syst. 17, 3, Article 30 (May 2021), 26 pages

arXiv:2006.12274 [pdf, other]

Dataflow Aware Map** of Convolutional Neural Networks Onto Many-Core Platforms With Network-on-Chip Interconnect

Authors: Andreas Bytyn, René Ahlsdorf, Rainer Leupers, Gerd Ascheid

Abstract: Machine intelligence, especially using convolutional neural networks (CNNs), has become a large area of research over the past years. Increasingly sophisticated hardware accelerators are proposed that exploit e.g. the sparsity in computations and make use of reduced precision arithmetic to scale down the energy consumption. However, future platforms require more than just energy efficiency: Scalab… ▽ More Machine intelligence, especially using convolutional neural networks (CNNs), has become a large area of research over the past years. Increasingly sophisticated hardware accelerators are proposed that exploit e.g. the sparsity in computations and make use of reduced precision arithmetic to scale down the energy consumption. However, future platforms require more than just energy efficiency: Scalability is becoming an increasingly important factor. The required effort for physical implementation grows with the size of the accelerator making it more difficult to meet target constraints. Using many-core platforms consisting of several homogeneous cores can alleviate the aforementioned limitations with regard to physical implementation at the expense of an increased dataflow map** effort. While the dataflow in CNNs is deterministic and can therefore be optimized offline, the problem of finding a suitable scheme that minimizes both runtime and off-chip memory accesses is a challenging task which becomes even more complex if an interconnect system is involved. This work presents an automated map** strategy starting at the single-core level with different optimization targets for minimal runtime and minimal off-chip memory accesses. The strategy is then extended towards a suitable many-core map** scheme and evaluated using a scalable system-level simulation with a network-on-chip interconnect. Design space exploration is performed by map** the well-known CNNs AlexNet and VGG-16 to platforms of different core counts and computational power per core in order to investigate the trade-offs. Our map** strategy and system setup is scaled starting from the single core level up to 128 cores, thereby showing the limits of the selected approach. △ Less

Submitted 18 June, 2020; originally announced June 2020.

arXiv:2006.00364 [pdf, other]

CLARINET: A RISC-V Based Framework for Posit Arithmetic Empiricism

Authors: Niraj Sharma, Riya Jain, Madhumita Mohan, Sachin Patkar, Rainer Leupers, Nikhil Rishiyur, Farhad Merchant

Abstract: Many engineering and scientific applications require high precision arithmetic. IEEE~754-2008 compliant (floating-point) arithmetic is the de facto standard for performing these computations. Recently, posit arithmetic has been proposed as a drop-in replacement for floating-point arithmetic. The posit\texttrademark data representation and arithmetic claim several absolute advantages over the float… ▽ More Many engineering and scientific applications require high precision arithmetic. IEEE~754-2008 compliant (floating-point) arithmetic is the de facto standard for performing these computations. Recently, posit arithmetic has been proposed as a drop-in replacement for floating-point arithmetic. The posit\texttrademark data representation and arithmetic claim several absolute advantages over the floating-point format and arithmetic, including higher dynamic range, better accuracy, and superior performance-area trade-offs. However, there does not exist any accessible, holistic framework that facilitates the validation of these claims of posit arithmetic, especially when the claims involve long accumulations (quire). In this paper, we present a consolidated general-purpose processor-based framework to support posit arithmetic empiricism. The end-users of the framework have the liberty to seamlessly experiment with their applications using posit and floating-point arithmetic since the framework is designed for the two number systems to coexist. Melodica is a posit arithmetic core that implements parametric fused operations that uniquely involve the quire data type. Clarinet is a Melodica-enabled processor based on the RISC-V ISA. To the best of our knowledge, this is the first-ever integration of quire with a RISC-V core. To show the effectiveness of the Clarinet platform, we perform an extensive application study and benchmark some of the common linear algebra and computer vision kernels. We emulate Clarinet on a Xilinx FPGA and present utilization and timing data. Clarinet and Melodica remain actively under development and is available in open-source for posit arithmetic empiricism. △ Less

Submitted 27 October, 2021; v1 submitted 30 May, 2020; originally announced June 2020.

arXiv:1904.05106 [pdf, other]

doi 10.1109/ISCAS.2019.8702357

An Application-Specific VLIW Processor with Vector Instruction Set for CNN Acceleration

Authors: Andreas Bytyn, Rainer Leupers, Gerd Ascheid

Abstract: In recent years, neural networks have surpassed classical algorithms in areas such as object recognition, e.g. in the well-known ImageNet challenge. As a result, great effort is being put into develo** fast and efficient accelerators, especially for Convolutional Neural Networks (CNNs). In this work we present ConvAix, a fully C-programmable processor, which -- contrary to many existing architec… ▽ More In recent years, neural networks have surpassed classical algorithms in areas such as object recognition, e.g. in the well-known ImageNet challenge. As a result, great effort is being put into develo** fast and efficient accelerators, especially for Convolutional Neural Networks (CNNs). In this work we present ConvAix, a fully C-programmable processor, which -- contrary to many existing architectures -- does not rely on a hard-wired array of multiply-and-accumulate (MAC) units. Instead it maps computations onto independent vector lanes making use of a carefully designed vector instruction set. The presented processor is targeted towards latency-sensitive applications and is capable of executing up to 192 MAC operations per cycle. ConvAix operates at a target clock frequency of 400 MHz in 28nm CMOS, thereby offering state-of-the-art performance with proper flexibility within its target domain. Simulation results for several 2D convolutional layers from well known CNNs (AlexNet, VGG-16) show an average ALU utilization of 72.5% using vector instructions with 16 bit fixed-point arithmetic. Compared to other well-known designs which are less flexible, ConvAix offers competitive energy efficiency of up to 497 GOP/s/W while even surpassing them in terms of area efficiency and processing speed. △ Less

Submitted 10 April, 2019; originally announced April 2019.

Comments: Accepted for publication in the proceedings of the 2019 IEEE International Symposium on Circuits and Systems (ISCAS)

arXiv:1803.05320 [pdf, other]

Efficient Realization of Givens Rotation through Algorithm-Architecture Co-design for Acceleration of QR Factorization

Authors: Farhad Merchant, Tarun Vatwani, Anupam Chattopadhyay, Soumyendu Raha, S K Nandy, Ranjani Narayan, Rainer Leupers

Abstract: We present efficient realization of Generalized Givens Rotation (GGR) based QR factorization that achieves 3-100x better performance in terms of Gflops/watt over state-of-the-art realizations on multicore, and General Purpose Graphics Processing Units (GPGPUs). GGR is an improvement over classical Givens Rotation (GR) operation that can annihilate multiple elements of rows and columns of an input… ▽ More We present efficient realization of Generalized Givens Rotation (GGR) based QR factorization that achieves 3-100x better performance in terms of Gflops/watt over state-of-the-art realizations on multicore, and General Purpose Graphics Processing Units (GPGPUs). GGR is an improvement over classical Givens Rotation (GR) operation that can annihilate multiple elements of rows and columns of an input matrix simultaneously. GGR takes 33% lesser multiplications compared to GR. For custom implementation of GGR, we identify macro operations in GGR and realize them on a Reconfigurable Data-path (RDP) tightly coupled to pipeline of a Processing Element (PE). In PE, GGR attains speed-up of 1.1x over Modified Householder Transform (MHT) presented in the literature. For parallel realization of GGR, we use REDEFINE, a scalable massively parallel Coarse-grained Reconfigurable Architecture, and show that the speed-up attained is commensurate with the hardware resources in REDEFINE. GGR also outperforms General Matrix Multiplication (gemm) by 10% in-terms of Gflops/watt which is counter-intuitive. △ Less

Submitted 23 March, 2018; v1 submitted 14 March, 2018; originally announced March 2018.

arXiv:1305.1459 [pdf]

doi 10.12837/2013T01

EURETILE 2010-2012 summary: first three years of activity of the European Reference Tiled Experiment

Authors: Pier Stanislao Paolucci, Iuliana Bacivarov, Gert Goossens, Rainer Leupers, Frédéric Rousseau, Christoph Schumacher, Lothar Thiele, Piero Vicini

Abstract: This is the summary of first three years of activity of the EURETILE FP7 project 247846. EURETILE investigates and implements brain-inspired and fault-tolerant foundational innovations to the system architecture of massively parallel tiled computer architectures and the corresponding programming paradigm. The execution targets are a many-tile HW platform, and a many-tile simulator. A set of SW pro… ▽ More This is the summary of first three years of activity of the EURETILE FP7 project 247846. EURETILE investigates and implements brain-inspired and fault-tolerant foundational innovations to the system architecture of massively parallel tiled computer architectures and the corresponding programming paradigm. The execution targets are a many-tile HW platform, and a many-tile simulator. A set of SW process - HW tile map** candidates is generated by the holistic SW tool-chain using a combination of analytic and bio-inspired methods. The Hardware dependent Software is then generated, providing OS services with maximum efficiency/minimal overhead. The many-tile simulator collects profiling data, closing the loop of the SW tool chain. Fine-grain parallelism inside processes is exploited by optimized intra-tile compilation techniques, but the project focus is above the level of the elementary tile. The elementary HW tile is a multi-processor, which includes a fault tolerant Distributed Network Processor (for inter-tile communication) and ASIP accelerators. Furthermore, EURETILE investigates and implements the innovations for equip** the elementary HW tile with high-bandwidth, low-latency brain-like inter-tile communication emulating 3 levels of connection hierarchy, namely neural columns, cortical areas and cortex, and develops a dedicated cortical simulation benchmark: DPSNN-STDP (Distributed Polychronous Spiking Neural Net with synaptic Spiking Time Dependent Plasticity). EURETILE leverages on the multi-tile HW paradigm and SW tool-chain developed by the FET-ACA SHAPES Integrated Project (2006-2009). △ Less

Submitted 7 May, 2013; originally announced May 2013.

Comments: 56 pages

ACM Class: C.1.4; C.3; B.7.2; F.2.2

arXiv:0910.3427 [pdf, ps, other]

doi 10.1109/TCSII.2010.2056014

A Scalable VLSI Architecture for Soft-Input Soft-Output Depth-First Sphere Decoding

Authors: Ernst Martin Witte, Filippo Borlenghi, Gerd Ascheid, Rainer Leupers, Heinrich Meyr

Abstract: Multiple-input multiple-output (MIMO) wireless transmission imposes huge challenges on the design of efficient hardware architectures for iterative receivers. A major challenge is soft-input soft-output (SISO) MIMO demap**, often approached by sphere decoding (SD). In this paper, we introduce the - to our best knowledge - first VLSI architecture for SISO SD applying a single tree-search approach… ▽ More Multiple-input multiple-output (MIMO) wireless transmission imposes huge challenges on the design of efficient hardware architectures for iterative receivers. A major challenge is soft-input soft-output (SISO) MIMO demap**, often approached by sphere decoding (SD). In this paper, we introduce the - to our best knowledge - first VLSI architecture for SISO SD applying a single tree-search approach. Compared with a soft-output-only base architecture similar to the one proposed by Studer et al. in IEEE J-SAC 2008, the architectural modifications for soft input still allow a one-node-per-cycle execution. For a 4x4 16-QAM system, the area increases by 57% and the operating frequency degrades by 34% only. △ Less

Submitted 7 June, 2010; v1 submitted 18 October, 2009; originally announced October 2009.

Comments: Accepted for IEEE Transactions on Circuits and Systems II Express Briefs, May 2010. This draft from April 2010 will not be updated any more. Please refer to IEEE Xplore for the final version. *) The final publication will appear with the modified title "A Scalable VLSI Architecture for Soft-Input Soft-Output Single Tree-Search Sphere Decoding"

Journal ref: IEEE Transactions on Circuits and Systems-Part II: Express Briefs, vol. 57, no. 9, pp. 706-710, Sep 2010

Showing 1–34 of 34 results for author: Leupers, R