Search | arXiv e-print repository

arXiv:2406.19522 [pdf, other]

Reliable edge machine learning hardware for scientific applications

Authors: Tommaso Baldi, Javier Campos, Ben Hawks, Jennifer Ngadiuba, Nhan Tran, Daniel Diaz, Javier Duarte, Ryan Kastner, Andres Meza, Melissa Quinnan, Olivia Weng, Caleb Geniesse, Amir Gholami, Michael W. Mahoney, Vladimir Loncar, Philip Harris, Joshua Agar, Shuyu Qin

Abstract: Extreme data rate scientific experiments create massive amounts of data that require efficient ML edge processing. This leads to unique validation challenges for VLSI implementations of ML algorithms: enabling bit-accurate functional simulations for performance validation in experimental software frameworks, verifying those ML models are robust under extreme quantization and pruning, and enabling… ▽ More Extreme data rate scientific experiments create massive amounts of data that require efficient ML edge processing. This leads to unique validation challenges for VLSI implementations of ML algorithms: enabling bit-accurate functional simulations for performance validation in experimental software frameworks, verifying those ML models are robust under extreme quantization and pruning, and enabling ultra-fine-grained model inspection for efficient fault tolerance. We discuss approaches to develo** and validating reliable algorithms at the scientific edge under such strict latency, resource, power, and area requirements in extreme experimental environments. We study metrics for develo** robust algorithms, present preliminary results and mitigation strategies, and conclude with an outlook of these and future directions of research towards the longer-term goal of develo** autonomous scientific experimentation methods for accelerated scientific discovery. △ Less

Submitted 27 June, 2024; originally announced June 2024.

Comments: IEEE VLSI Test Symposium 2024 (VTS)

Report number: FERMILAB-CONF-24-0116-CSAID

arXiv:2403.08980 [pdf, other]

Architectural Implications of Neural Network Inference for High Data-Rate, Low-Latency Scientific Applications

Authors: Olivia Weng, Alexander Redding, Nhan Tran, Javier Mauricio Duarte, Ryan Kastner

Abstract: With more scientific fields relying on neural networks (NNs) to process data incoming at extreme throughputs and latencies, it is crucial to develop NNs with all their parameters stored on-chip. In many of these applications, there is not enough time to go off-chip and retrieve weights. Even more so, off-chip memory such as DRAM does not have the bandwidth required to process these NNs as fast as… ▽ More With more scientific fields relying on neural networks (NNs) to process data incoming at extreme throughputs and latencies, it is crucial to develop NNs with all their parameters stored on-chip. In many of these applications, there is not enough time to go off-chip and retrieve weights. Even more so, off-chip memory such as DRAM does not have the bandwidth required to process these NNs as fast as the data is being produced (e.g., every 25 ns). As such, these extreme latency and bandwidth requirements have architectural implications for the hardware intended to run these NNs: 1) all NN parameters must fit on-chip, and 2) codesigning custom/reconfigurable logic is often required to meet these latency and bandwidth constraints. In our work, we show that many scientific NN applications must run fully on chip, in the extreme case requiring a custom chip to meet such stringent constraints. △ Less

Submitted 13 March, 2024; originally announced March 2024.

arXiv:2401.15639 [pdf, other]

TOP: Towards Open & Predictable Heterogeneous SoCs

Authors: Luca Valente, Francesco Restuccia, Davide Rossi, Ryan Kastner, Luca Benini

Abstract: Ensuring predictability in modern real-time Systems-on-Chip (SoCs) is an increasingly critical concern for many application domains such as automotive, robotics, and industrial automation. An effective approach involves the modeling and development of hardware components, such as interconnects and shared memory resources, to evaluate or enforce their deterministic behavior. Unfortunately, these IP… ▽ More Ensuring predictability in modern real-time Systems-on-Chip (SoCs) is an increasingly critical concern for many application domains such as automotive, robotics, and industrial automation. An effective approach involves the modeling and development of hardware components, such as interconnects and shared memory resources, to evaluate or enforce their deterministic behavior. Unfortunately, these IPs are often closed-source, and these studies are limited to the single modules that must later be integrated with third-party IPs in more complex SoCs, hindering the precision and scope of modeling and compromising the overall predictability. With the coming-of-age of open-source instruction set architectures (RISC-V) and hardware, major opportunities for changing this status quo are emerging. This study introduces an innovative methodology for modeling and analyzing State-of-the-Art (SoA) open-source SoCs for low-power cyber-physical systems. Our approach models and analyzes the entire set of open-source IPs within these SoCs and then provides a comprehensive analysis of the entire architecture. We validate this methodology on a sample heterogenous low-power RISC-V architecture through RTL simulation and FPGA implementation, minimizing pessimism in bounding the service time of transactions crossing the architecture between 28% and 1%, which is considerably lower when compared to similar SoA works. △ Less

Submitted 7 June, 2024; v1 submitted 28 January, 2024; originally announced January 2024.

arXiv:2304.08263 [pdf, other]

Information Flow Coverage Metrics for Hardware Security Verification

Authors: Andres Meza, Ryan Kastner

Abstract: Security graphs model attacks, defenses, mitigations, and vulnerabilities on computer networks and systems. With proper attributes, they provide security metrics using standard graph algorithms. A hyperflow graph is a register-transfer level (RTL) hardware security graph that facilitates security verification. A hyperflow graph models information flows and is annotated with attributes that allow s… ▽ More Security graphs model attacks, defenses, mitigations, and vulnerabilities on computer networks and systems. With proper attributes, they provide security metrics using standard graph algorithms. A hyperflow graph is a register-transfer level (RTL) hardware security graph that facilitates security verification. A hyperflow graph models information flows and is annotated with attributes that allow security metrics to measure flow paths, flow conditions, and flow rates. Hyperflow graphs enable the understanding of hardware vulnerabilities related to confidentiality, integrity, and availability, as shown on the OpenTitan hardware root of trust under several threat models. △ Less

Submitted 12 April, 2023; originally announced April 2023.

Comments: 6 pages, 3 Figures

arXiv:2304.07543 [pdf, other]

Within-Camera Multilayer Perceptron DVS Denoising

Authors: A. Rios-Navarro, S. Guo, G Abarajithan, K. Vijayakumar, A. Linares-Barranco, T. Aarrestad, R. Kastner, T. Delbruck

Abstract: In-camera event denoising reduces the data rate of event cameras by filtering out noise at the source. A lightweight multilayer perceptron denoising filter (MLPF) provides state-of-the-art low-cost denoising accuracy. It processes a small neighborhood of pixels from the timestamp image around each event to discriminate signal and noise events. This paper proposes two digital logic implementations… ▽ More In-camera event denoising reduces the data rate of event cameras by filtering out noise at the source. A lightweight multilayer perceptron denoising filter (MLPF) provides state-of-the-art low-cost denoising accuracy. It processes a small neighborhood of pixels from the timestamp image around each event to discriminate signal and noise events. This paper proposes two digital logic implementations of the MLPF denoiser and quantifies their resource cost, power, and latency. The hardware MLPF quantizes the weights and hidden unit activations to 4 bits and has about 1k weights with about 40% sparsity. The Area-Under-Curve Receiver Operating Characteristic accuracy is nearly indistinguishable from that of the floating point network. The FPGA MLPF processes each event in 10 clock cycles. In FPGA, it uses 3.5k flip flops and 11.5k LUTs. Our ASIC implementation in 65nm digital technology for a 346x260 pixel camera occupies an area of 4.3mm^2 and consumes 4nJ of energy per event at event rates up to 25MHz. The MLPF can be easily integrated into an event camera using an FPGA or as an ASIC directly on the camera chip or in the same package. This denoising could dramatically reduce the energy consumed by the communication and host processor and open new areas of always-on event camera application under scavenged and battery power. Code: https://github.com/SensorsINI/dnd_hls △ Less

Submitted 15 April, 2023; originally announced April 2023.

Comments: Accepted to 2023 CVPRW Workshop on Event-Based Vision

arXiv:2303.17881 [pdf, other]

Pentimento: Data Remanence in Cloud FPGAs

Authors: Colin Drewes, Olivia Weng, Andres Meza, Alric Althoff, David Kohlbrenner, Ryan Kastner, Dustin Richmond

Abstract: Cloud FPGAs strike an alluring balance between computational efficiency, energy efficiency, and cost. It is the flexibility of the FPGA architecture that enables these benefits, but that very same flexibility that exposes new security vulnerabilities. We show that a remote attacker can recover "FPGA pentimenti" - long-removed secret data belonging to a prior user of a cloud FPGA. The sensitive dat… ▽ More Cloud FPGAs strike an alluring balance between computational efficiency, energy efficiency, and cost. It is the flexibility of the FPGA architecture that enables these benefits, but that very same flexibility that exposes new security vulnerabilities. We show that a remote attacker can recover "FPGA pentimenti" - long-removed secret data belonging to a prior user of a cloud FPGA. The sensitive data constituting an FPGA pentimento is an analog imprint from bias temperature instability (BTI) effects on the underlying transistors. We demonstrate how this slight degradation can be measured using a time-to-digital (TDC) converter when an adversary programs one into the target cloud FPGA. This technique allows an attacker to ascertain previously safe information on cloud FPGAs, even after it is no longer explicitly present. Notably, it can allow an attacker who knows a non-secret "skeleton" (the physical structure, but not the contents) of the victim's design to (1) extract proprietary details from an encrypted FPGA design image available on the AWS marketplace and (2) recover data loaded at runtime by a previous user of a cloud FPGA using a known design. Our experiments show that BTI degradation (burn-in) and recovery are measurable and constitute a security threat to commercial cloud FPGAs. △ Less

Submitted 31 March, 2023; originally announced March 2023.

Comments: 17 Pages, 8 Figures

arXiv:2301.07247 [pdf, other]

Tailor: Altering Skip Connections for Resource-Efficient Inference

Authors: Olivia Weng, Gabriel Marcano, Vladimir Loncar, Alireza Khodamoradi, Nojan Sheybani, Andres Meza, Farinaz Koushanfar, Kristof Denolf, Javier Mauricio Duarte, Ryan Kastner

Abstract: Deep neural networks use skip connections to improve training convergence. However, these skip connections are costly in hardware, requiring extra buffers and increasing on- and off-chip memory utilization and bandwidth requirements. In this paper, we show that skip connections can be optimized for hardware when tackled with a hardware-software codesign approach. We argue that while a network's sk… ▽ More Deep neural networks use skip connections to improve training convergence. However, these skip connections are costly in hardware, requiring extra buffers and increasing on- and off-chip memory utilization and bandwidth requirements. In this paper, we show that skip connections can be optimized for hardware when tackled with a hardware-software codesign approach. We argue that while a network's skip connections are needed for the network to learn, they can later be removed or shortened to provide a more hardware efficient implementation with minimal to no accuracy loss. We introduce Tailor, a codesign tool whose hardware-aware training algorithm gradually removes or shortens a fully trained network's skip connections to lower their hardware cost. Tailor improves resource utilization by up to 34% for BRAMs, 13% for FFs, and 16% for LUTs for on-chip, dataflow-style architectures. Tailor increases performance by 30% and reduces memory bandwidth by 45% for a 2D processing element array architecture. △ Less

Submitted 15 September, 2023; v1 submitted 17 January, 2023; originally announced January 2023.

arXiv:2206.11791 [pdf, other]

Open-source FPGA-ML codesign for the MLPerf Tiny Benchmark

Authors: Hendrik Borras, Giuseppe Di Guglielmo, Javier Duarte, Nicolò Ghielmetti, Ben Hawks, Scott Hauck, Shih-Chieh Hsu, Ryan Kastner, Jason Liang, Andres Meza, Jules Muhizi, Tai Nguyen, Rushil Roy, Nhan Tran, Yaman Umuroglu, Olivia Weng, Aidan Yokuda, Michaela Blott

Abstract: We present our development experience and recent results for the MLPerf Tiny Inference Benchmark on field-programmable gate array (FPGA) platforms. We use the open-source hls4ml and FINN workflows, which aim to democratize AI-hardware codesign of optimized neural networks on FPGAs. We present the design and implementation process for the keyword spotting, anomaly detection, and image classificatio… ▽ More We present our development experience and recent results for the MLPerf Tiny Inference Benchmark on field-programmable gate array (FPGA) platforms. We use the open-source hls4ml and FINN workflows, which aim to democratize AI-hardware codesign of optimized neural networks on FPGAs. We present the design and implementation process for the keyword spotting, anomaly detection, and image classification benchmark tasks. The resulting hardware implementations are quantized, configurable, spatial dataflow architectures tailored for speed and efficiency and introduce new generic optimizations and common workflows developed as a part of this work. The full workflow is presented from quantization-aware training to FPGA implementation. The solutions are deployed on system-on-chip (Pynq-Z2) and pure FPGA (Arty A7-100T) platforms. The resulting submissions achieve latencies as low as 20 $μ$s and energy consumption as low as 30 $μ$J per inference. We demonstrate how emerging ML benchmarks on heterogeneous hardware platforms can catalyze collaboration and the development of new techniques and more accessible tools. △ Less

Submitted 23 June, 2022; originally announced June 2022.

Comments: 15 pages, 7 figures, Contribution to 3rd Workshop on Benchmarking Machine Learning Workloads on Emerging Hardware (MLBench) at 5th Conference on Machine Learning and Systems (MLSys)

Report number: FERMILAB-CONF-22-479-SCD

arXiv:2110.13041 [pdf, other]

doi 10.3389/fdata.2022.787421

Applications and Techniques for Fast Machine Learning in Science

Authors: Allison McCarn Deiana, Nhan Tran, Joshua Agar, Michaela Blott, Giuseppe Di Guglielmo, Javier Duarte, Philip Harris, Scott Hauck, Mia Liu, Mark S. Neubauer, Jennifer Ngadiuba, Seda Ogrenci-Memik, Maurizio Pierini, Thea Aarrestad, Steffen Bahr, Jurgen Becker, Anne-Sophie Berthold, Richard J. Bonventre, Tomas E. Muller Bravo, Markus Diefenthaler, Zhen Dong, Nick Fritzsche, Amir Gholami, Ekaterina Govorkova, Kyle J Hazelwood , et al. (62 additional authors not shown)

Abstract: In this community review report, we discuss applications and techniques for fast machine learning (ML) in science -- the concept of integrating power ML methods into the real-time experimental data processing loop to accelerate scientific discovery. The material for the report builds on two workshops held by the Fast ML for Science community and covers three main areas: applications for fast ML ac… ▽ More In this community review report, we discuss applications and techniques for fast machine learning (ML) in science -- the concept of integrating power ML methods into the real-time experimental data processing loop to accelerate scientific discovery. The material for the report builds on two workshops held by the Fast ML for Science community and covers three main areas: applications for fast ML across a number of scientific domains; techniques for training and implementing performant and resource-efficient ML algorithms; and computing architectures, platforms, and technologies for deploying these algorithms. We also present overlap** challenges across the multiple scientific domains where common solutions can be found. This community report is intended to give plenty of examples and inspiration for scientific discovery through integrated and accelerated ML solutions. This is followed by a high-level overview and organization of technical advances, including an abundance of pointers to source material, which can enable these breakthroughs. △ Less

Submitted 25 October, 2021; originally announced October 2021.

Comments: 66 pages, 13 figures, 5 tables

Report number: FERMILAB-PUB-21-502-AD-E-SCD

Journal ref: Front. Big Data 5, 787421 (2022)

arXiv:2110.06870 [pdf, ps, other]

Junkyard Computing: Repurposing Discarded Smartphones to Minimize Carbon

Authors: Jennifer Switzer, Gabriel Marcano, Ryan Kastner, Pat Pannuto

Abstract: 1.5 billion smartphones are sold annually, and most are decommissioned less than two years later. Most of these unwanted smartphones are neither discarded nor recycled but languish in junk drawers and storage units. This computational stockpile represents a substantial wasted potential: modern smartphones have increasingly high-performance and energy-efficient processors, extensive networking capa… ▽ More 1.5 billion smartphones are sold annually, and most are decommissioned less than two years later. Most of these unwanted smartphones are neither discarded nor recycled but languish in junk drawers and storage units. This computational stockpile represents a substantial wasted potential: modern smartphones have increasingly high-performance and energy-efficient processors, extensive networking capabilities, and a reliable built-in power supply. This project studies the ability to reuse smartphones as "junkyard computers." Junkyard computers grow global computing capacity by extending device lifetimes, which supplants the manufacture of new devices. We show that the capabilities of even decade-old smartphones are within those demanded by modern cloud microservices and discuss how to combine phones to perform increasingly complex tasks. We describe how current operation-focused metrics do not capture the actual carbon costs of compute. We propose Computational Carbon Intensity -- a performance metric that balances the continued service of older devices with the superlinear runtime improvements of newer machines. We use this metric to redefine device service lifetime in terms of carbon efficiency. We develop a cloudlet of reused Pixel 3A phones. We analyze the carbon benefits of deploying large, end-to-end microservice-based applications on these smartphones. Finally, we describe system architectures and associated challenges to scale to cloudlets with hundreds and thousands of smartphones. △ Less

Submitted 25 October, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

arXiv:2106.13263 [pdf, other]

AKER: A Design and Verification Framework for Safe andSecure SoC Access Control

Authors: Francesco Restuccia, Andres Meza, Ryan Kastner

Abstract: Modern systems on a chip (SoCs) utilize heterogeneous architectures where multiple IP cores have concurrent access to on-chip shared resources. In security-critical applications, IP cores have different privilege levels for accessing shared resources, which must be regulated by an access control system. AKER is a design and verification framework for SoC access control. AKER builds upon the Access… ▽ More Modern systems on a chip (SoCs) utilize heterogeneous architectures where multiple IP cores have concurrent access to on-chip shared resources. In security-critical applications, IP cores have different privilege levels for accessing shared resources, which must be regulated by an access control system. AKER is a design and verification framework for SoC access control. AKER builds upon the Access Control Wrapper (ACW) -- a high performance and easy-to-integrate hardware module that dynamically manages access to shared resources. To build an SoC access control system, AKER distributes the ACWs throughout the SoC, wrap** controller IP cores, and configuring the ACWs to perform local access control. To ensure the access control system is functioning correctly and securely, AKER provides a property-driven security verification using MITRE common weakness enumerations. AKER verifies the SoC access control at the IP level to ensure the absence of bugs in the functionalities of the ACW module, at the firmware level to confirm the secure operation of the ACW when integrated with a hardware root-of-trust (HRoT), and at the system level to evaluate security threats due to the interactions among shared resources. The performance, resource usage, and security of access control systems implemented through AKER is experimentally evaluated on a Xilinx UltraScale+ programmable SoC, it is integrated with the OpenTitan hardware root-of-trust, and it is used to design an access control system for the OpenPULP multicore architecture. △ Less

Submitted 24 June, 2021; originally announced June 2021.

arXiv:2106.07449 [pdf, other]

doi 10.1145/3474376.3487286

Isadora: Automated Information Flow Property Generation for Hardware Designs

Authors: Calvin Deutschbein, Andres Meza, Francesco Restuccia, Ryan Kastner, Cynthia Sturton

Abstract: Isadora is a methodology for creating information flow specifications of hardware designs. The methodology combines information flow tracking and specification mining to produce a set of information flow properties that are suitable for use during the security validation process, and which support a better understanding of the security posture of the design. Isadora is fully automated; the user pr… ▽ More Isadora is a methodology for creating information flow specifications of hardware designs. The methodology combines information flow tracking and specification mining to produce a set of information flow properties that are suitable for use during the security validation process, and which support a better understanding of the security posture of the design. Isadora is fully automated; the user provides only the design under consideration and a testbench and need not supply a threat model nor security specifications. We evaluate Isadora on a RISC-V processor plus two designs related to SoC access control. Isadora generates security properties that align with those suggested by the Common Weakness Enumerations (CWEs), and in the case of the SoC designs, align with the properties written manually by security experts. △ Less

Submitted 2 October, 2021; v1 submitted 14 June, 2021; originally announced June 2021.

Comments: 10 pages, 4 figures, accepted at ASHES 2021

arXiv:2102.01351 [pdf]

Hardware-efficient Residual Networks for FPGAs

Authors: Olivia Weng, Alireza Khodamoradi, Ryan Kastner

Abstract: Residual networks (ResNets) employ skip connections in their networks -- reusing activations from previous layers -- to improve training convergence, but these skip connections create challenges for hardware implementations of ResNets. The hardware must either wait for skip connections to be processed before processing more incoming data or buffer them elsewhere. Without skip connections, ResNets… ▽ More Residual networks (ResNets) employ skip connections in their networks -- reusing activations from previous layers -- to improve training convergence, but these skip connections create challenges for hardware implementations of ResNets. The hardware must either wait for skip connections to be processed before processing more incoming data or buffer them elsewhere. Without skip connections, ResNets would be more hardware-efficient. Thus, we present the teacher-student learning method to gradually prune away all of a ResNet's skip connections, constructing a network we call NonResNet. We show that when implemented for FPGAs, NonResNet decreases ResNet's BRAM utilization by 9% and LUT utilization by 3% and increases throughput by 5%. △ Less

Submitted 2 February, 2021; originally announced February 2021.

Comments: Presented at DATE Friday Workshop on System-level Design Methods for Deep Learning on Heterogeneous Architectures (SLOHA 2021) (arXiv:2102.00818)

Report number: SLOHA/2021/10

arXiv:2012.02791 [pdf, other]

A Unified Model for Gate Level Propagation Analysis

Authors: Jeremy Blackstone, Wei Hu, Alric Althoff, Armaiti Ardeshiricham, Lu Zhang, Ryan Kastner

Abstract: Classic hardware verification techniques (e.g., X-propagation and fault-propagation) and more recent hardware security verification techniques based on information flow tracking (IFT) aim to understand how information passes, affects, and otherwise modifies a circuit. These techniques all have separate usage scenarios, but when dissected into their core functionality, they relate in a fundamental… ▽ More Classic hardware verification techniques (e.g., X-propagation and fault-propagation) and more recent hardware security verification techniques based on information flow tracking (IFT) aim to understand how information passes, affects, and otherwise modifies a circuit. These techniques all have separate usage scenarios, but when dissected into their core functionality, they relate in a fundamental manner. In this paper, we develop a common framework for gate level propagation analysis. We use our model to generate synthesizable propagation logic to use in standard EDA tools. To justify our model, we prove that Precise Hardware IFT is equivalent to gate level X-propagation and imprecise fault propagation. We also show that the difference between Precise Hardware IFT and fault propagation is not significant for 74X-series and '85 ISCAS benchmarks with more than 313 gates and the difference between imprecise hardware IFT and Precise Hardware IFT is almost always significant regardless of size. △ Less

Submitted 7 December, 2020; originally announced December 2020.

arXiv:2002.04971 [pdf, other]

doi 10.1109/ICCAD45719.2019.8942122

FastWave: Accelerating Autoregressive Convolutional Neural Networks on FPGA

Authors: Shehzeen Hussain, Mojan Javaheripi, Paarth Neekhara, Ryan Kastner, Farinaz Koushanfar

Abstract: Autoregressive convolutional neural networks (CNNs) have been widely exploited for sequence generation tasks such as audio synthesis, language modeling and neural machine translation. WaveNet is a deep autoregressive CNN composed of several stacked layers of dilated convolution that is used for sequence generation. While WaveNet produces state-of-the art audio generation results, the naive inferen… ▽ More Autoregressive convolutional neural networks (CNNs) have been widely exploited for sequence generation tasks such as audio synthesis, language modeling and neural machine translation. WaveNet is a deep autoregressive CNN composed of several stacked layers of dilated convolution that is used for sequence generation. While WaveNet produces state-of-the art audio generation results, the naive inference implementation is quite slow; it takes a few minutes to generate just one second of audio on a high-end GPU. In this work, we develop the first accelerator platform~\textit{FastWave} for autoregressive convolutional neural networks, and address the associated design challenges. We design the Fast-Wavenet inference model in Vivado HLS and perform a wide range of optimizations including fixed-point implementation, array partitioning and pipelining. Our model uses a fully parameterized parallel architecture for fast matrix-vector multiplication that enables per-layer customized latency fine-tuning for further throughput improvement. Our experiments comparatively assess the trade-off between throughput and resource utilization for various optimizations. Our best WaveNet design on the Xilinx XCVU13P FPGA that uses only on-chip memory, achieves 66 faster generation speed compared to CPU implementation and 11 faster generation speed than GPU implementation. △ Less

Submitted 9 February, 2020; originally announced February 2020.

Comments: Published as a conference paper at ICCAD 2019

Journal ref: @inproceedings {1143,booktitle = {IEEE/ACM 2019 International Conference On Computer Aided Design (ICCAD)},year = {2019},month = {November}}

arXiv:2001.10717 [pdf, other]

Patient Specific Biomechanics Are Clinically Significant In Accurate Computer Aided Surgical Image Guidance

Authors: Michael Barrow, Alice Chao, Qizhi He, Sonia Ramamoorthy, Claude Sirlin, Ryan Kastner

Abstract: Augmented Reality is used in Image Guided surgery (AR IG) to fuse surgical landmarks from preoperative images into a video overlay. Physical simulation is essential to maintaining accurate position of the landmarks as surgery progresses and ensuring patient safety by avoiding accidental damage to vessels etc. In liver procedures, AR IG simulation accuracy is hampered by an inability to model stiff… ▽ More Augmented Reality is used in Image Guided surgery (AR IG) to fuse surgical landmarks from preoperative images into a video overlay. Physical simulation is essential to maintaining accurate position of the landmarks as surgery progresses and ensuring patient safety by avoiding accidental damage to vessels etc. In liver procedures, AR IG simulation accuracy is hampered by an inability to model stiffness variations unique to the patients disease. We introduce a novel method to account for patient specific stiffness variation based on Magnetic Resonance Elastography (MRE) data. To the best of our knowledge we are the first to demonstrate the use of in-vivo biomechanical data for AR IG landmark placement. In this early work, a comparative evaluation of our MRE data driven simulation and the traditional method shows clinically significant differences in accuracy during landmark placement and motivates further animal model trials. △ Less

Submitted 29 January, 2020; originally announced January 2020.

Comments: 7 pages, 8 figures

arXiv:1909.00713 [pdf, other]

Estimation of Absolute Scale in Monocular SLAM Using Synthetic Data

Authors: Danila Rukhovich, Daniel Mouritzen, Ralf Kaestner, Martin Rufli, Alexander Velizhev

Abstract: This paper addresses the problem of scale estimation in monocular SLAM by estimating absolute distances between camera centers of consecutive image frames. These estimates would improve the overall performance of classical (not deep) SLAM systems and allow metric feature locations to be recovered from a single monocular camera. We propose several network architectures that lead to an improvement o… ▽ More This paper addresses the problem of scale estimation in monocular SLAM by estimating absolute distances between camera centers of consecutive image frames. These estimates would improve the overall performance of classical (not deep) SLAM systems and allow metric feature locations to be recovered from a single monocular camera. We propose several network architectures that lead to an improvement of scale estimation accuracy over the state of the art. In addition, we exploit a possibility to train the neural network only with synthetic data derived from a computer graphics simulator. Our key insight is that, using only synthetic training inputs, we can achieve similar scale estimation accuracy as that obtained from real data. This fact indicates that fully annotated simulated data is a viable alternative to existing deep-learning-based SLAM systems trained on real (unlabeled) data. Our experiments with unsupervised domain adaptation also show that the difference in visual appearance between simulated and real data does not affect scale estimation results. Our method operates with low-resolution images (0.03MP), which makes it practical for real-time SLAM applications with a monocular camera. △ Less

Submitted 2 September, 2019; originally announced September 2019.

arXiv:1805.03648 [pdf, other]

Parallel Programming for FPGAs

Authors: Ryan Kastner, Janarbek Matai, Stephen Neuendorffer

Abstract: This book focuses on the use of algorithmic high-level synthesis (HLS) to build application-specific FPGA systems. Our goal is to give the reader an appreciation of the process of creating an optimized hardware design using HLS. Although the details are, of necessity, different from parallel programming for multicore processors or GPUs, many of the fundamental concepts are similar. For example, de… ▽ More This book focuses on the use of algorithmic high-level synthesis (HLS) to build application-specific FPGA systems. Our goal is to give the reader an appreciation of the process of creating an optimized hardware design using HLS. Although the details are, of necessity, different from parallel programming for multicore processors or GPUs, many of the fundamental concepts are similar. For example, designers must understand memory hierarchy and bandwidth, spatial and temporal locality of reference, parallelism, and tradeoffs between computation and storage. This book is a practical guide for anyone interested in building FPGA systems. In a university environment, it is appropriate for advanced undergraduate and graduate courses. At the same time, it is also useful for practicing system designers and embedded programmers. The book assumes the reader has a working knowledge of C/C++ and includes a significant amount of sample code. In addition, we assume familiarity with basic computer architecture concepts (pipelining, speedup, Amdahl's Law, etc.). A knowledge of the RTL-based FPGA design flow is helpful, although not required. △ Less

Submitted 9 May, 2018; originally announced May 2018.

arXiv:1408.5870 [pdf]

Enabling FPGAs for the Masses

Authors: Janarbek Matai, Dustin Richmond, Dajung Lee, Ryan Kastner

Abstract: Implementing an application on a FPGA remains a difficult, non-intuitive task that often requires hardware design expertise in a hardware description language (HDL). High-level synthesis (HLS) raises the design abstraction from HDL to languages such as C/C++/Scala/Java. Despite this, in order to get a good quality of result (QoR), a designer must carefully craft the HLS code. In other words, HLS d… ▽ More Implementing an application on a FPGA remains a difficult, non-intuitive task that often requires hardware design expertise in a hardware description language (HDL). High-level synthesis (HLS) raises the design abstraction from HDL to languages such as C/C++/Scala/Java. Despite this, in order to get a good quality of result (QoR), a designer must carefully craft the HLS code. In other words, HLS designers must implement the application using an abstract language in a manner that generates an efficient micro-architecture; we call this process writing restructured code. This reduces the benefits of implementing the application at a higher level of abstraction and limits the impact of HLS by requiring explicit knowledge of the underlying hardware architecture. Developers must know how to write code that reflects low level implementation details of the application at hand as it is interpreted by HLS tools. As a result, FPGA design still largely remains job of either hardware engineers or expert HLS designers. In this work, we aim to take a step towards making HLS tools useful for a broader set of programmers. To do this, we study methodologies of restructuring software code for HLS tools; we provide examples of designing different kernels in state-of-the art HLS tools; and we present a list of challenges for develo** a hardware programming model for software programmers. △ Less

Submitted 20 August, 2014; originally announced August 2014.

Comments: Presented at First International Workshop on FPGAs for Software Programmers (FSP 2014) (arXiv:1408.4423)

Report number: FSP/2014/03

Showing 1–19 of 19 results for author: Kastner, R