-
Generative AI in the Construction Industry: A State-of-the-art Analysis
Authors:
Ridwan Taiwo,
Idris Temitope Bello,
Sulemana Fatoama Abdulai,
Abdul-Mugis Yussif,
Babatunde Abiodun Salami,
Abdullahi Saka,
Tarek Zayed
Abstract:
The construction industry is a vital sector of the global economy, but it faces many productivity challenges in various processes, such as design, planning, procurement, inspection, and maintenance. Generative artificial intelligence (AI), which can create novel and realistic data or content, such as text, image, video, or code, based on some input or prior knowledge, offers innovative and disrupt…
▽ More
The construction industry is a vital sector of the global economy, but it faces many productivity challenges in various processes, such as design, planning, procurement, inspection, and maintenance. Generative artificial intelligence (AI), which can create novel and realistic data or content, such as text, image, video, or code, based on some input or prior knowledge, offers innovative and disruptive solutions to address these challenges. However, there is a gap in the literature on the current state, opportunities, and challenges of generative AI in the construction industry. This study aims to fill this gap by providing a state-of-the-art analysis of generative AI in construction, with three objectives: (1) to review and categorize the existing and emerging generative AI opportunities and challenges in the construction industry; (2) to propose a framework for construction firms to build customized generative AI solutions using their own data, comprising steps such as data collection, dataset curation, training custom large language model (LLM), model evaluation, and deployment; and (3) to demonstrate the framework via a case study of develo** a generative model for querying contract documents. The results show that retrieval augmented generation (RAG) improves the baseline LLM by 5.2, 9.4, and 4.8% in terms of quality, relevance, and reproducibility. This study provides academics and construction professionals with a comprehensive analysis and practical framework to guide the adoption of generative AI techniques to enhance productivity, quality, safety, and sustainability across the construction industry.
△ Less
Submitted 15 February, 2024;
originally announced February 2024.
-
Makinote: An FPGA-Based HW/SW Platform for Pre-Silicon Emulation of RISC-V Designs
Authors:
Elias Perdomo,
Alexander Kropotov,
Francelly Cano,
Syed Zafar,
Teresa Cervero,
Xavier Martorell,
Behzad Salami
Abstract:
Emulating chip functionality before silicon production is crucial, especially with the increasing prevalence of RISC-V-based designs. FPGAs are promising candidates for such purposes due to their high-speed and reconfigurable architecture. In this paper, we introduce our Makinote, an FPGA-based Cluster platform, hosted at Barcelona Supercomputing Center (BSC-CNS), which is composed of a large numb…
▽ More
Emulating chip functionality before silicon production is crucial, especially with the increasing prevalence of RISC-V-based designs. FPGAs are promising candidates for such purposes due to their high-speed and reconfigurable architecture. In this paper, we introduce our Makinote, an FPGA-based Cluster platform, hosted at Barcelona Supercomputing Center (BSC-CNS), which is composed of a large number of FPGAs (in total 96 AMD/Xilinx Alveo U55c) to emulate massive size RTL designs (up to 750M ASIC cells). In addition, we introduce our FPGA shell as a powerful tool to facilitate the utilization of such a large FPGA cluster with minimal effort needed by the designers. The proposed FPGA shell provides an easy-to-use interface for the RTL developers to rapidly port such design into several FPGAs by automatically connecting to the necessary ports, e.g., PCIe Gen4, DRAM (DDR4 and HBM), ETH10g/100g. Moreover, specific drivers for exploiting RISC-V based architectures are provided within the set of tools associated with the FPGA shell. We release the tool online for further extensions.
We validate the efficiency of our hardware platform (i.e., FPGA cluster) and the software tool (i.e., FPGA Shell) by emulating a RISC-V processor and experimenting HPC Challenge application running on 32 FPGAs. Our results demonstrate that the performance improves by 8 times over the single-FPGA case.
△ Less
Submitted 5 February, 2024; v1 submitted 31 January, 2024;
originally announced January 2024.
-
Read Disturbance in High Bandwidth Memory: A Detailed Experimental Study on HBM2 DRAM Chips
Authors:
Ataberk Olgun,
Majd Osseiran,
Abdullah Giray Yaglikci,
Yahya Can Tugrul,
Haocong Luo,
Steve Rhyner,
Behzad Salami,
Juan Gomez Luna,
Onur Mutlu
Abstract:
We experimentally demonstrate the effects of read disturbance (RowHammer and RowPress) and uncover the inner workings of undocumented read disturbance defense mechanisms in High Bandwidth Memory (HBM). Detailed characterization of six real HBM2 DRAM chips in two different FPGA boards shows that (1) the read disturbance vulnerability significantly varies between different HBM2 chips and between dif…
▽ More
We experimentally demonstrate the effects of read disturbance (RowHammer and RowPress) and uncover the inner workings of undocumented read disturbance defense mechanisms in High Bandwidth Memory (HBM). Detailed characterization of six real HBM2 DRAM chips in two different FPGA boards shows that (1) the read disturbance vulnerability significantly varies between different HBM2 chips and between different components (e.g., 3D-stacked channels) inside a chip, (2) DRAM rows at the end and in the middle of a bank are more resilient to read disturbance, (3) fewer additional activations are sufficient to induce more read disturbance bitflips in a DRAM row if the row exhibits the first bitflip at a relatively high activation count, (4) a modern HBM2 chip implements undocumented read disturbance defenses that track potential aggressor rows based on how many times they are activated. We describe how our findings could be leveraged to develop more powerful read disturbance attacks and more efficient defense mechanisms. We open source all our code and data to facilitate future research at https://github.com/CMU-SAFARI/HBM-Read-Disturbance.
△ Less
Submitted 2 May, 2024; v1 submitted 23 October, 2023;
originally announced October 2023.
-
GPT Models in Construction Industry: Opportunities, Limitations, and a Use Case Validation
Authors:
Abdullahi Saka,
Ridwan Taiwo,
Nurudeen Saka,
Babatunde Salami,
Saheed Ajayi,
Kabiru Akande,
Hadi Kazemi
Abstract:
Large Language Models(LLMs) trained on large data sets came into prominence in 2018 after Google introduced BERT. Subsequently, different LLMs such as GPT models from OpenAI have been released. These models perform well on diverse tasks and have been gaining widespread applications in fields such as business and education. However, little is known about the opportunities and challenges of using LL…
▽ More
Large Language Models(LLMs) trained on large data sets came into prominence in 2018 after Google introduced BERT. Subsequently, different LLMs such as GPT models from OpenAI have been released. These models perform well on diverse tasks and have been gaining widespread applications in fields such as business and education. However, little is known about the opportunities and challenges of using LLMs in the construction industry. Thus, this study aims to assess GPT models in the construction industry. A critical review, expert discussion and case study validation are employed to achieve the study objectives. The findings revealed opportunities for GPT models throughout the project lifecycle. The challenges of leveraging GPT models are highlighted and a use case prototype is developed for materials selection and optimization. The findings of the study would be of benefit to researchers, practitioners and stakeholders, as it presents research vistas for LLMs in the construction industry.
△ Less
Submitted 30 May, 2023;
originally announced May 2023.
-
An Experimental Analysis of RowHammer in HBM2 DRAM Chips
Authors:
Ataberk Olgun,
Majd Osseiran,
Abdullah Giray Ya{ğ}lık{c}ı,
Yahya Can Tuğrul,
Haocong Luo,
Steve Rhyner,
Behzad Salami,
Juan Gomez Luna,
Onur Mutlu
Abstract:
RowHammer (RH) is a significant and worsening security, safety, and reliability issue of modern DRAM chips that can be exploited to break memory isolation. Therefore, it is important to understand real DRAM chips' RH characteristics. Unfortunately, no prior work extensively studies the RH vulnerability of modern 3D-stacked high-bandwidth memory (HBM) chips, which are commonly used in modern GPUs.…
▽ More
RowHammer (RH) is a significant and worsening security, safety, and reliability issue of modern DRAM chips that can be exploited to break memory isolation. Therefore, it is important to understand real DRAM chips' RH characteristics. Unfortunately, no prior work extensively studies the RH vulnerability of modern 3D-stacked high-bandwidth memory (HBM) chips, which are commonly used in modern GPUs.
In this work, we experimentally characterize the RH vulnerability of a real HBM2 DRAM chip. We show that 1) different 3D-stacked channels of HBM2 memory exhibit significantly different levels of RH vulnerability (up to 79% difference in bit error rate), 2) the DRAM rows at the end of a DRAM bank (rows with the highest addresses) exhibit significantly fewer RH bitflips than other rows, and 3) a modern HBM2 DRAM chip implements undisclosed RH defenses that are triggered by periodic refresh operations. We describe the implications of our observations on future RH attacks and defenses and discuss future work for understanding RH in 3D-stacked memories.
△ Less
Submitted 29 May, 2023;
originally announced May 2023.
-
TuRaN: True Random Number Generation Using Supply Voltage Underscaling in SRAMs
Authors:
İsmail Emir Yüksel,
Ataberk Olgun,
Behzad Salami,
F. Nisa Bostancı,
Yahya Can Tuğrul,
A. Giray Yağlıkçı,
Nika Mansouri Ghiasi,
Onur Mutlu,
Oğuz Ergin
Abstract:
Prior works propose SRAM-based TRNGs that extract entropy from SRAM arrays. SRAM arrays are widely used in a majority of specialized or general-purpose chips that perform the computation to store data inside the chip. Thus, SRAM-based TRNGs present a low-cost alternative to dedicated hardware TRNGs. However, existing SRAM-based TRNGs suffer from 1) low TRNG throughput, 2) high energy consumption,…
▽ More
Prior works propose SRAM-based TRNGs that extract entropy from SRAM arrays. SRAM arrays are widely used in a majority of specialized or general-purpose chips that perform the computation to store data inside the chip. Thus, SRAM-based TRNGs present a low-cost alternative to dedicated hardware TRNGs. However, existing SRAM-based TRNGs suffer from 1) low TRNG throughput, 2) high energy consumption, 3) high TRNG latency, and 4) the inability to generate true random numbers continuously, which limits the application space of SRAM-based TRNGs. Our goal in this paper is to design an SRAM-based TRNG that overcomes these four key limitations and thus, extends the application space of SRAM-based TRNGs. To this end, we propose TuRaN, a new high-throughput, energy-efficient, and low-latency SRAM-based TRNG that can sustain continuous operation. TuRaN leverages the key observation that accessing SRAM cells results in random access failures when the supply voltage is reduced below the manufacturer-recommended supply voltage. TuRaN generates random numbers at high throughput by repeatedly accessing SRAM cells with reduced supply voltage and post-processing the resulting random faults using the SHA-256 hash function. To demonstrate the feasibility of TuRaN, we conduct SPICE simulations on different process nodes and analyze the potential of access failure for use as an entropy source. We verify and support our simulation results by conducting real-world experiments on two commercial off-the-shelf FPGA boards. We evaluate the quality of the random numbers generated by TuRaN using the widely-adopted NIST standard randomness tests and observe that TuRaN passes all tests. TuRaN generates true random numbers with (i) an average (maximum) throughput of 1.6Gbps (1.812Gbps), (ii) 0.11nJ/bit energy consumption, and (iii) 278.46us latency.
△ Less
Submitted 20 November, 2022;
originally announced November 2022.
-
NEON: Enabling Efficient Support for Nonlinear Operations in Resistive RAM-based Neural Network Accelerators
Authors:
Aditya Manglik,
Minesh Patel,
Haiyu Mao,
Behzad Salami,
Jisung Park,
Lois Orosa,
Onur Mutlu
Abstract:
Resistive Random-Access Memory (RRAM) is well-suited to accelerate neural network (NN) workloads as RRAM-based Processing-in-Memory (PIM) architectures natively support highly-parallel multiply-accumulate (MAC) operations that form the backbone of most NN workloads. Unfortunately, NN workloads such as transformers require support for non-MAC operations (e.g., softmax) that RRAM cannot provide nati…
▽ More
Resistive Random-Access Memory (RRAM) is well-suited to accelerate neural network (NN) workloads as RRAM-based Processing-in-Memory (PIM) architectures natively support highly-parallel multiply-accumulate (MAC) operations that form the backbone of most NN workloads. Unfortunately, NN workloads such as transformers require support for non-MAC operations (e.g., softmax) that RRAM cannot provide natively. Consequently, state-of-the-art works either integrate additional digital logic circuits to support the non-MAC operations or offload the non-MAC operations to CPU/GPU, resulting in significant performance and energy efficiency overheads due to data movement.
In this work, we propose NEON, a novel compiler optimization to enable the end-to-end execution of the NN workload in RRAM. The key idea of NEON is to transform each non-MAC operation into a lightweight yet highly-accurate neural network. Utilizing neural networks to approximate the non-MAC operations provides two advantages: 1) We can exploit the key strength of RRAM, i.e., highly-parallel MAC operation, to flexibly and efficiently execute non-MAC operations in memory. 2) We can simplify RRAM's microarchitecture by eliminating the additional digital logic circuits while reducing the data movement overheads. Acceleration of the non-MAC operations in memory enables NEON to achieve a 2.28x speedup compared to an idealized digital logic-based RRAM. We analyze the trade-offs associated with the transformation and demonstrate feasible use cases for NEON across different substrates.
△ Less
Submitted 10 November, 2022;
originally announced November 2022.
-
PiDRAM: An FPGA-based Framework for End-to-end Evaluation of Processing-in-DRAM Techniques
Authors:
Ataberk Olgun,
Juan Gomez Luna,
Konstantinos Kanellopoulos,
Behzad Salami,
Hasan Hassan,
Oguz Ergin,
Onur Mutlu
Abstract:
DRAM-based main memory is used in nearly all computing systems as a major component. One way of overcoming the main memory bottleneck is to move computation near memory, a paradigm known as processing-in-memory (PiM). Recent PiM techniques provide a promising way to improve the performance and energy efficiency of existing and future systems at no additional DRAM hardware cost.
We develop the Pr…
▽ More
DRAM-based main memory is used in nearly all computing systems as a major component. One way of overcoming the main memory bottleneck is to move computation near memory, a paradigm known as processing-in-memory (PiM). Recent PiM techniques provide a promising way to improve the performance and energy efficiency of existing and future systems at no additional DRAM hardware cost.
We develop the Processing-in-DRAM (PiDRAM) framework, the first flexible, end-to-end, and open source framework that enables system integration studies and evaluation of real PiM techniques using real DRAM chips. We demonstrate a prototype of PiDRAM on an FPGA-based platform (Xilinx ZC706) that implements an open-source RISC-V system (Rocket Chip). To demonstrate the flexibility and ease of use of PiDRAM, we implement two PiM techniques: (1) RowClone, an in-DRAM copy and initialization mechanism (using command sequences proposed by ComputeDRAM), and (2) D-RaNGe, an in-DRAM true random number generator based on DRAM activation-latency failures.
Our end-to-end evaluation of RowClone shows up to 14.6X speedup for copy and 12.6X initialization operations over CPU copy (i.e., conventional memcpy) and initialization (i.e., conventional calloc) operations. Our implementation of D-RaNGe provides high throughput true random numbers, reaching 8.30 Mb/s throughput. Over the Verilog and C++ basis provided by PiDRAM, implementing the required hardware and software components, implementing RowClone end-to-end takes 198 (565) and implementing D-RaNGe end-to-end takes 190 (78) lines of Verilog (C++) code. PiDRAM is open sourced on Github: https://github.com/CMU-SAFARI/PiDRAM.
△ Less
Submitted 1 June, 2022;
originally announced June 2022.
-
PiDRAM: A Holistic End-to-end FPGA-based Framework for Processing-in-DRAM
Authors:
Ataberk Olgun,
Juan Gómez Luna,
Konstantinos Kanellopoulos,
Behzad Salami,
Hasan Hassan,
Oğuz Ergin,
Onur Mutlu
Abstract:
Processing-using-memory (PuM) techniques leverage the analog operation of memory cells to perform computation. Several recent works have demonstrated PuM techniques in off-the-shelf DRAM devices. Since DRAM is the dominant memory technology as main memory in current computing systems, these PuM techniques represent an opportunity for alleviating the data movement bottleneck at very low cost. Howev…
▽ More
Processing-using-memory (PuM) techniques leverage the analog operation of memory cells to perform computation. Several recent works have demonstrated PuM techniques in off-the-shelf DRAM devices. Since DRAM is the dominant memory technology as main memory in current computing systems, these PuM techniques represent an opportunity for alleviating the data movement bottleneck at very low cost. However, system integration of PuM techniques imposes non-trivial challenges that are yet to be solved. Design space exploration of potential solutions to the PuM integration challenges requires appropriate tools to develop necessary hardware and software components. Unfortunately, current specialized DRAM-testing platforms, or system simulators do not provide the flexibility and/or the holistic system view that is necessary to deal with PuM integration challenges.
We design and develop PiDRAM, the first flexible end-to-end framework that enables system integration studies and evaluation of real PuM techniques. PiDRAM provides software and hardware components to rapidly integrate PuM techniques across the whole system software and hardware stack (e.g., necessary modifications in the operating system, memory controller). We implement PiDRAM on an FPGA-based platform along with an open-source RISC-V system. Using PiDRAM, we implement and evaluate two state-of-the-art PuM techniques: in-DRAM (i) copy and initialization, (ii) true random number generation. Our results show that the in-memory copy and initialization techniques can improve the performance of bulk copy operations by 12.6x and bulk initialization operations by 14.6x on a real system. Implementing the true random number generator requires only 190 lines of Verilog and 74 lines of C code using PiDRAM's software and hardware components.
△ Less
Submitted 4 September, 2023; v1 submitted 29 October, 2021;
originally announced November 2021.
-
MoRS: An Approximate Fault Modelling Framework for Reduced-Voltage SRAMs
Authors:
İsmail Emir Yüksel,
Behzad Salami,
Oğuz Ergin,
Osman Sabri Ünsal,
Adrian Cristal Kestelman
Abstract:
On-chip memory (usually based on Static RAMs-SRAMs) are crucial components for various computing devices including heterogeneous devices, e.g., GPUs, FPGAs, ASICs to achieve high performance. Modern workloads such as Deep Neural Networks (DNNs) running on these heterogeneous fabrics are highly dependent on the on-chip memory architecture for efficient acceleration. Hence, improving the energy-effi…
▽ More
On-chip memory (usually based on Static RAMs-SRAMs) are crucial components for various computing devices including heterogeneous devices, e.g., GPUs, FPGAs, ASICs to achieve high performance. Modern workloads such as Deep Neural Networks (DNNs) running on these heterogeneous fabrics are highly dependent on the on-chip memory architecture for efficient acceleration. Hence, improving the energy-efficiency of such memories directly leads to an efficient system. One of the common methods to save energy is undervolting i.e., supply voltage underscaling below the nominal level. Such systems can be safely undervolted without incurring faults down to a certain voltage limit. This safe range is also called voltage guardband. However, reducing voltage below the guardband level without decreasing frequency causes timing-based faults.
In this paper, we propose MoRS, a framework that generates the first approximate undervolting fault model using real faults extracted from experimental undervolting studies on SRAMs to build the model. We inject the faults generated by MoRS into the on-chip memory of the DNN accelerator to evaluate the resilience of the system under the test. MoRS has the advantage of simplicity without any need for high-time overhead experiments while being accurate enough in comparison to a fully randomly-generated fault injection approach. We evaluate our experiment in popular DNN workloads by map** weights to SRAMs and measure the accuracy difference between the output of the MoRS and the real data. Our results show that the maximum difference between real fault data and the output fault model of MoRS is 6.21%, whereas the maximum difference between real data and random fault injection model is 23.2%. In terms of average proximity to the real data, the output of MoRS outperforms the random fault injection approach by 3.21x.
△ Less
Submitted 19 July, 2022; v1 submitted 12 October, 2021;
originally announced October 2021.
-
On the Impact of Device-Level Techniques on Energy-Efficiency of Neural Network Accelerators
Authors:
Seyed Morteza Nabavinejad,
Behzad Salami
Abstract:
Energy-efficiency is a key concern for neural network applications. To alleviate this issue, hardware acceleration using FPGAs or GPUs can provide better energy-efficiency than general-purpose processors. However, further improvement of the energy-efficiency of such accelerators will be extremely beneficial specially to deploy neural network in power-constrained edge computing environments. In thi…
▽ More
Energy-efficiency is a key concern for neural network applications. To alleviate this issue, hardware acceleration using FPGAs or GPUs can provide better energy-efficiency than general-purpose processors. However, further improvement of the energy-efficiency of such accelerators will be extremely beneficial specially to deploy neural network in power-constrained edge computing environments. In this paper, we experimentally explore the potential of device-level energy-efficiency techniques (e.g.,supply voltage underscaling, frequency scaling, and data quantization) for representative off-the-shelf FPGAs compared to GPUs. Frequency scaling in both platforms can improve the power and energy consumption but with performance overhead, e.g.,in GPUs it improves the power consumption and GOPs/J by up to 34% and 28%, respectively. However, leveraging reduced-precision instructions improves power (up to 13%), energy (up to 20%), and performance (up to 7%) simultaneously, with negligible reduction in accuracy of neural network accuracy.
△ Less
Submitted 26 June, 2021;
originally announced June 2021.
-
Understanding Power Consumption and Reliability of High-Bandwidth Memory with Voltage Underscaling
Authors:
Seyed Saber Nabavi Larimi,
Behzad Salami,
Osman S. Unsal,
Adrian Cristal Kestelman,
Hamid Sarbazi-Azad,
Onur Mutlu
Abstract:
Modern computing devices employ High-Bandwidth Memory (HBM) to meet their memory bandwidth requirements. An HBM-enabled device consists of multiple DRAM layers stacked on top of one another next to a compute chip (e.g. CPU, GPU, and FPGA) in the same package. Although such HBM structures provide high bandwidth at a small form factor, the stacked memory layers consume a substantial portion of the p…
▽ More
Modern computing devices employ High-Bandwidth Memory (HBM) to meet their memory bandwidth requirements. An HBM-enabled device consists of multiple DRAM layers stacked on top of one another next to a compute chip (e.g. CPU, GPU, and FPGA) in the same package. Although such HBM structures provide high bandwidth at a small form factor, the stacked memory layers consume a substantial portion of the package's power budget. Therefore, power-saving techniques that preserve the performance of HBM are desirable. Undervolting is one such technique: it reduces the supply voltage to decrease power consumption without reducing the device's operating frequency to avoid performance loss. Undervolting takes advantage of voltage guardbands put in place by manufacturers to ensure correct operation under all environmental conditions. However, reducing voltage without changing frequency can lead to reliability issues manifested as unwanted bit flips. In this paper, we provide the first experimental study of real HBM chips under reduced-voltage conditions. We show that the guardband regions for our HBM chips constitute 19% of the nominal voltage. Pushing the supply voltage down within the guardband region reduces power consumption by a factor of 1.5X for all bandwidth utilization rates. Pushing the voltage down further by 11% leads to a total of2.3X power savings at the cost of unwanted bit flips. We explore and characterize the rate and types of these reduced-voltage-induced bit flips and present a fault map that enables the possibility of a three-factor trade-off among power, memory capacity, and fault rate.
△ Less
Submitted 30 December, 2020;
originally announced January 2021.
-
Exceeding Conservative Limits: A Consolidated Analysis on Modern Hardware Margins
Authors:
George Papadimitriou,
Athanasios Chatzidimitriou,
Dimitris Gizopoulos,
Vijay Janapa Reddi,
**gwen Leng,
Behzad Salami,
Osman S. Unsal,
Adrian Cristal Kestelman
Abstract:
Modern large-scale computing systems (data centers, supercomputers, cloud and edge setups and high-end cyber-physical systems) employ heterogeneous architectures that consist of multicore CPUs, general-purpose many-core GPUs, and programmable FPGAs. The effective utilization of these architectures poses several challenges, among which a primary one is power consumption. Voltage reduction is one of…
▽ More
Modern large-scale computing systems (data centers, supercomputers, cloud and edge setups and high-end cyber-physical systems) employ heterogeneous architectures that consist of multicore CPUs, general-purpose many-core GPUs, and programmable FPGAs. The effective utilization of these architectures poses several challenges, among which a primary one is power consumption. Voltage reduction is one of the most efficient methods to reduce power consumption of a chip. With the gallo** adoption of hardware accelerators (i.e., GPUs and FPGAs) in large datacenters and other large-scale computing infrastructures, a comprehensive evaluation of the safe voltage reduction levels for each different chip can be employed for efficient reduction of the total power. We present a survey of recent studies in voltage margins reduction at the system level for modern CPUs, GPUs and FPGAs. The pessimistic voltage guardbands inserted by the silicon vendors can be exploited in all devices for significant power savings. On average, voltage reduction can reach 12% in multicore CPUs, 20% in manycore GPUs and 39% in FPGAs.
△ Less
Submitted 1 June, 2020;
originally announced June 2020.
-
Power and Accuracy of Multi-Layer Perceptrons (MLPs) under Reduced-voltage FPGA BRAMs Operation
Authors:
Behzad Salami,
Osman Unsal,
Adrian Cristal
Abstract:
In this paper, we exploit the aggressive supply voltage underscaling technique in Block RAMs (BRAMs) of Field Programmable Gate Arrays (FPGAs) to improve the energy efficiency of Multi-Layer Perceptrons (MLPs). Additionally, we evaluate and improve the resilience of this accelerator. Through experiments on several representative FPGA fabrics, we observe that until a minimum safe voltage level, i.e…
▽ More
In this paper, we exploit the aggressive supply voltage underscaling technique in Block RAMs (BRAMs) of Field Programmable Gate Arrays (FPGAs) to improve the energy efficiency of Multi-Layer Perceptrons (MLPs). Additionally, we evaluate and improve the resilience of this accelerator. Through experiments on several representative FPGA fabrics, we observe that until a minimum safe voltage level, i.e., Vmin the MLP accuracy is not affected. This safe region involves a large voltage guardband. Also, it involves a narrower voltage region where faults start to appear in memories due to the increased circuit delay, but these faults are masked by MLP, and thus, its accuracy is not affected. However, further undervolting causes significant accuracy loss as a result of the fast-increasing high fault rates. Based on the characterization of these undervolting faults, we propose fault mitigation techniques that can effectively improve the resilience behavior of such accelerator. Our evaluation is based on four FPGA platforms. On average, we achieve >90% energy saving with a negligible accuracy loss of up to 0.1%.
△ Less
Submitted 10 May, 2020;
originally announced May 2020.
-
An Experimental Study of Reduced-Voltage Operation in Modern FPGAs for Neural Network Acceleration
Authors:
Behzad Salami,
Erhan Baturay Onural,
Ismail Emir Yuksel,
Fahrettin Koc,
Oguz Ergin,
Adrian Cristal Kestelman,
Osman S. Unsal,
Hamid Sarbazi-Azad,
Onur Mutlu
Abstract:
We empirically evaluate an undervolting technique, i.e., underscaling the circuit supply voltage below the nominal level, to improve the power-efficiency of Convolutional Neural Network (CNN) accelerators mapped to Field Programmable Gate Arrays (FPGAs). Undervolting below a safe voltage level can lead to timing faults due to excessive circuit latency increase. We evaluate the reliability-power tr…
▽ More
We empirically evaluate an undervolting technique, i.e., underscaling the circuit supply voltage below the nominal level, to improve the power-efficiency of Convolutional Neural Network (CNN) accelerators mapped to Field Programmable Gate Arrays (FPGAs). Undervolting below a safe voltage level can lead to timing faults due to excessive circuit latency increase. We evaluate the reliability-power trade-off for such accelerators. Specifically, we experimentally study the reduced-voltage operation of multiple components of real FPGAs, characterize the corresponding reliability behavior of CNN accelerators, propose techniques to minimize the drawbacks of reduced-voltage operation, and combine undervolting with architectural CNN optimization techniques, i.e., quantization and pruning. We investigate the effect of environmental temperature on the reliability-power trade-off of such accelerators. We perform experiments on three identical samples of modern Xilinx ZCU102 FPGA platforms with five state-of-the-art image classification CNN benchmarks. This approach allows us to study the effects of our undervolting technique for both software and hardware variability. We achieve more than 3X power-efficiency (GOPs/W) gain via undervolting. 2.6X of this gain is the result of eliminating the voltage guardband region, i.e., the safe voltage region below the nominal level that is set by FPGA vendor to ensure correct functionality in worst-case environmental and circuit conditions. 43% of the power-efficiency gain is due to further undervolting below the guardband, which comes at the cost of accuracy loss in the CNN accelerator. We evaluate an effective frequency underscaling technique that prevents this accuracy loss, and find that it reduces the power-efficiency gain from 43% to 25%.
△ Less
Submitted 30 December, 2020; v1 submitted 4 May, 2020;
originally announced May 2020.
-
On the Resilience of Deep Learning for Reduced-voltage FPGAs
Authors:
Kamyar Givaki,
Behzad Salami,
Reza Hojabr,
S. M. Reza Tayaranian,
Ahmad Khonsari,
Dara Rahmati,
Saeid Gorgin,
Adrian Cristal,
Osman S. Unsal
Abstract:
Deep Neural Networks (DNNs) are inherently computation-intensive and also power-hungry. Hardware accelerators such as Field Programmable Gate Arrays (FPGAs) are a promising solution that can satisfy these requirements for both embedded and High-Performance Computing (HPC) systems. In FPGAs, as well as CPUs and GPUs, aggressive voltage scaling below the nominal level is an effective technique for p…
▽ More
Deep Neural Networks (DNNs) are inherently computation-intensive and also power-hungry. Hardware accelerators such as Field Programmable Gate Arrays (FPGAs) are a promising solution that can satisfy these requirements for both embedded and High-Performance Computing (HPC) systems. In FPGAs, as well as CPUs and GPUs, aggressive voltage scaling below the nominal level is an effective technique for power dissipation minimization. Unfortunately, bit-flip faults start to appear as the voltage is scaled down closer to the transistor threshold due to timing issues, thus creating a resilience issue.
This paper experimentally evaluates the resilience of the training phase of DNNs in the presence of voltage underscaling related faults of FPGAs, especially in on-chip memories. Toward this goal, we have experimentally evaluated the resilience of LeNet-5 and also a specially designed network for CIFAR-10 dataset with different activation functions of Rectified Linear Unit (Relu) and Hyperbolic Tangent (Tanh). We have found that modern FPGAs are robust enough in extremely low-voltage levels and that low-voltage related faults can be automatically masked within the training iterations, so there is no need for costly software- or hardware-oriented fault mitigation techniques like ECC. Approximately 10% more training iterations are needed to fill the gap in the accuracy. This observation is the result of the relatively low rate of undervolting faults, i.e., <0.1\%, measured on real FPGA fabrics. We have also increased the fault rate significantly for the LeNet-5 network by randomly generated fault injection campaigns and observed that the training accuracy starts to degrade. When the fault rate increases, the network with Tanh activation function outperforms the one with Relu in terms of accuracy, e.g., when the fault rate is 30% the accuracy difference is 4.92%.
△ Less
Submitted 26 December, 2019;
originally announced January 2020.
-
LEGaTO: Low-Energy, Secure, and Resilient Toolset for Heterogeneous Computing
Authors:
B. Salami,
K. Parasyris,
A. Cristal,
O. Unsal,
X. Martorell,
P. Carpenter,
R. De La Cruz,
L. Bautista,
D. Jimenez,
C. Alvarez,
S. Nabavi,
S. Madonar,
M. Pericas,
P. Trancoso,
M. Abduljabbar,
J. Chen,
P. N. Soomro,
M Manivannan,
M. Berge,
S. Krupop,
F. Klawonn,
Al Mekhlafi,
S. May,
T. Becker,
G. Gaydadjiev
, et al. (20 additional authors not shown)
Abstract:
The LEGaTO project leverages task-based programming models to provide a software ecosystem for Made in-Europe heterogeneous hardware composed of CPUs, GPUs, FPGAs and dataflow engines. The aim is to attain one order of magnitude energy savings from the edge to the converged cloud/HPC, balanced with the security and resilience challenges. LEGaTO is an ongoing three-year EU H2020 project started in…
▽ More
The LEGaTO project leverages task-based programming models to provide a software ecosystem for Made in-Europe heterogeneous hardware composed of CPUs, GPUs, FPGAs and dataflow engines. The aim is to attain one order of magnitude energy savings from the edge to the converged cloud/HPC, balanced with the security and resilience challenges. LEGaTO is an ongoing three-year EU H2020 project started in December 2017.
△ Less
Submitted 1 December, 2019;
originally announced December 2019.
-
A Novel FPGA-Based High Throughput Accelerator For Binary Search Trees
Authors:
Oyku Melikoglu,
Oguz Ergin,
Behzad Salami,
Julian Pavon,
Osman Unsal,
Adrian Cristal
Abstract:
This paper presents a deeply pipelined and massively parallel Binary Search Tree (BST) accelerator for Field Programmable Gate Arrays (FPGAs). Our design relies on the extremely parallel on-chip memory, or Block RAMs (BRAMs) architecture of FPGAs. To achieve significant throughput for the search operation on BST, we present several novel mechanisms including tree duplication as well as horizontal,…
▽ More
This paper presents a deeply pipelined and massively parallel Binary Search Tree (BST) accelerator for Field Programmable Gate Arrays (FPGAs). Our design relies on the extremely parallel on-chip memory, or Block RAMs (BRAMs) architecture of FPGAs. To achieve significant throughput for the search operation on BST, we present several novel mechanisms including tree duplication as well as horizontal, duplicated, and hybrid (horizontal-vertical) tree partitioning. Also, we present efficient techniques to decrease the stalling rates that can occur during the parallel tree search. By combining these techniques and implementations on Xilinx Virtex-7 VC709 platform, we achieve up to 8X throughput improvement gain in comparison to the baseline implementation, i.e., a fully-pipelined FPGA-based accelerator.
△ Less
Submitted 1 December, 2019;
originally announced December 2019.
-
Hardware Versus Software Fault Injection of Modern Undervolted SRAMs
Authors:
Muhammet Abdullah Soyturk,
Konstantinos Parasyris,
Behzad Salami,
Osman Unsal,
Gulay Yalcin,
Leonardo Bautista Gomez
Abstract:
To improve power efficiency, researchers are experimenting with dynamically adjusting the supply voltage of systems below the nominal operating points. However, production systems are typically not allowed to function on voltage settings that is below the reliable limit. Consequently, existing software fault tolerance studies are based on fault models, which inject faults on random fault locations…
▽ More
To improve power efficiency, researchers are experimenting with dynamically adjusting the supply voltage of systems below the nominal operating points. However, production systems are typically not allowed to function on voltage settings that is below the reliable limit. Consequently, existing software fault tolerance studies are based on fault models, which inject faults on random fault locations using fault injection techniques. In this work we study whether random fault injection is accurate to simulate the behavior of undervolted SRAMs.
Our study extends the Gem5 simulator to support fault injection on the caches of the simulated system. The fault injection framework uses fault maps, which describe the faulty bits of SRAMs, as inputs. To compare random fault injection and hardware guided fault injection, we use two types of fault maps. The first type of maps are created through undervolting real SRAMs and observing the location of the erroneous bits, whereas the second type of maps are created by corrupting random bits of the SRAMs. During our study we corrupt the L1-Dcache of the simulated system and we monitor the behavior of the two types of fault maps on the resiliency of six benchmarks. The difference among the resiliency of a benchmark when tested with the different fault maps can be up to 24%.
△ Less
Submitted 30 November, 2019;
originally announced December 2019.
-
Evaluating Built-in ECC of FPGA on-chip Memories for the Mitigation of Undervolting Faults
Authors:
Behzad Salami,
Osman S. Unsal,
Adrian Cristal Kestelman
Abstract:
Voltage underscaling below the nominal level is an effective solution for improving energy efficiency in digital circuits, e.g., Field Programmable Gate Arrays (FPGAs). However, further undervolting below a safe voltage level and without accompanying frequency scaling leads to timing related faults, potentially undermining the energy savings. Through experimental voltage underscaling studies on co…
▽ More
Voltage underscaling below the nominal level is an effective solution for improving energy efficiency in digital circuits, e.g., Field Programmable Gate Arrays (FPGAs). However, further undervolting below a safe voltage level and without accompanying frequency scaling leads to timing related faults, potentially undermining the energy savings. Through experimental voltage underscaling studies on commercial FPGAs, we observed that the rate of these faults exponentially increases for on-chip memories, or Block RAMs (BRAMs). To mitigate these faults, we evaluated the efficiency of the built-in Error-Correction Code (ECC) and observed that more than 90% of the faults are correctable and further 7% are detectable (but not correctable). This efficiency is the result of the single-bit type of these faults, which are then effectively covered by the Single-Error Correction and Double-Error Detection (SECDED) design of the built-in ECC. Finally, motivated by the above experimental observations, we evaluated an FPGA-based Neural Network (NN) accelerator under low-voltage operations, while built-in ECC is leveraged to mitigate undervolting faults and thus, prevent NN significant accuracy loss. In consequence, we achieve 40% of the BRAM power saving through undervolting below the minimum safe voltage level, with a negligible NN accuracy loss, thanks to the substantial fault coverage by the built-in ECC.
△ Less
Submitted 29 March, 2019;
originally announced March 2019.
-
On the Resilience of RTL NN Accelerators: Fault Characterization and Mitigation
Authors:
Behzad Salami,
Osman Unsal,
Adrian Cristal
Abstract:
Machine Learning (ML) is making a strong resurgence in tune with the massive generation of unstructured data which in turn requires massive computational resources. Due to the inherently compute- and power-intensive structure of Neural Networks (NNs), hardware accelerators emerge as a promising solution. However, with technology node scaling below 10nm, hardware accelerators become more susceptibl…
▽ More
Machine Learning (ML) is making a strong resurgence in tune with the massive generation of unstructured data which in turn requires massive computational resources. Due to the inherently compute- and power-intensive structure of Neural Networks (NNs), hardware accelerators emerge as a promising solution. However, with technology node scaling below 10nm, hardware accelerators become more susceptible to faults, which in turn can impact the NN accuracy. In this paper, we study the resilience aspects of Register-Transfer Level (RTL) model of NN accelerators, in particular, fault characterization and mitigation. By following a High-Level Synthesis (HLS) approach, first, we characterize the vulnerability of various components of RTL NN. We observed that the severity of faults depends on both i) application-level specifications, i.e., NN data (inputs, weights, or intermediate), NN layers, and NN activation functions, and ii) architectural-level specifications, i.e., data representation model and the parallelism degree of the underlying accelerator. Second, motivated by characterization results, we present a low-overhead fault mitigation technique that can efficiently correct bit flips, by 47.3% better than state-of-the-art methods.
△ Less
Submitted 14 June, 2018;
originally announced June 2018.